Giter Site home page Giter Site logo

gnmatcher's Introduction

GNames

API GoDoc

The goal of the GNames project is to provide an accurate and fast verification of scientific names in unlimited quantities. The verification should be fast (at least 1000 names per second) and include exact and fuzzy matching of input strings to scientific names aggregated from a large number of data-sources.

In case if you do not need exact records of matched names from data-sources, and just want to know if a name-string is known, you can use GNmatcher instead of this project. The GNmatcher is significantly faster and has simpler output.

Features

  • Fast verification of unlimited number of scientific names.
  • Multiple levels of verification:
    • Exact matching (exact string match for viruses, exact canonical form match for Plantae, Fungi, Bacteria, and Animalia).
    • Fuzzy matching detects human and/or Optical Character Recognition (OCR) errors without producing large number of false positives. To avoid false positives uninomial names only checked for exact match.
    • PartialExact matching happens when a match for the full name-string is not found. In such cases middle or end words are removed and each variant is verified. Matches of names with the last word intact does have a preference.
    • PartialFuzzy matching is provided for partial matches of species and infraspecies. To avoid false positives uninomials only checked for exact match.
    • Virus matching provides viruses verification.
    • FacetedSearch allows to use flexible query language for searching.
  • Providing names information from data-sources that contain a particular name.
    • Returning the "best" result. The BestResult is calculated by a scoring algorithm.
    • Optionally, limiting results to data-sources that are important to a GNames user.
  • Providing outlink URLs to some data-sources websites to show the original record of a name.
  • Providing meta-information about aggregated data-sources.

Installation

Most of the users do not need to install GNames and can use remote GNames API service at http://verifier.globalnames.org/api/v1 or use a command line client GNverifier. Nevertheless, it is possible to install a local copy of the service.

Installation prerequesites

  • A Linux-based operating system.
  • At least 32GB of memory.
  • At least 50GB of a free disk space.
  • Fast Internet connection during installation. After installation GNames can operate without remote connection.
  • PostgreSQL database.

Installation process

  1. PostgreSQL

    We are not covering basics of PostgreSQL administration here. There are many tutorials and resources for Linux-based operating systems that can help.

    Create a database named gnames. Download the gnames database dump. Restore the database with:

    gunzip -c gnames_latest.tar.gz |pg_restore -d gnames
  2. GNmatcher

    Refer to the GNmatcher documentation for its installation.

  3. GNames

    Download the latest release of GNames, unpack it and place somewhere in the PATH.

    Run gnames -V. It will show you the version of GNames and also generate $HOME/.config/gnames.yaml configuration file.

    Edit $HOME/.config/gnames.yaml according to your preferences.

    Try it by running

    gnames rest -p 8888

    To load service automatically you can create systemctl configuration for the service, if your system supports systemctl.

    Alternatively you can use docker image to run GNames. You will need to create a file with corresponding environment variables that are described in the .env.example file.

    docker pull gnames/gnames:latest
    docker run -env_file path_to_env_file -d -i -t -p 8888:8888 \
      gnames/gnames:latest rest -p 8888

    We provide an example of environment file. Environment variables override configuration file settings.

Configuration

Configuration settings can either be given in the config file located at $HOME/.config/gnames.yaml, or by setting the following environment variables:

Env. Var. Configuration
GN_CACHE_DIR CacheDir
GN_JOBS_NUM JobsNum
GN_MATCHER_URL MatcherURL
GN_MAX_EDIT_DIST MaxEditDist
GN_PG_DB PgDB
GN_PG_HOST PgHost
GN_PG_PASS PgPass
GN_PG_PORT PgPort
GN_PG_USER PgUser
GN_PORT Port

The meaning of configuration settings are provided in the default gnames.yaml.

Usage as API

Please note, that currently developed API (documentation) is publically served at https://verifier.globalnames.org/api/v1.

If you installed GNames locally and want to run its API, run:

gnames rest
# to change from default 8888 port
gnames rest -p 8787

Refer to GNames' RESTful API Documentation about interacting with GNames API.

Usage with GNverifier

GNverifier is a command line client for GNames backend. It uses publically available remote API of GNames. Install and use it according to the GNverifier documentation.

GNverifier also provides web-based user interface to GNames. To launch it use something like:

gnverifier -p 8777

Known limitations of the verification

  • Exact matches of misspellings that might exist in poorly curated databases prevent to find fuzzy matches from better curated sources.

    To increase performance we stop any further tries if a name matched
    successfully. This prevents fuzzy-matching if a misspelled name is found
    somewhere. It is helpful to check 'curation' field of returned result,
    and see how many data-sources do contain the name.
    
  • Fuzzy matching of a name where genus string is broken by a space.

    For example, we cannot match 'Abro stola triplasia' to 'Abrostola triplasia'. There is only 1 edit distance between the strings, however we stem specific epithets, so in reality we fuzzy-match 'Abro stol triplas' to 'Abrostola triplas'. That means now we have edit distance 2 which is usually beyond our threshold.

Development

  • Install Go language for your Linux operating system.
  • Create PostgreSQL database as described in installation.
  • Clone the GNames code.
  • Clone the GNmatcher and set it up for development.
  • Install docker and docker compose.
  • Go to your local gnames directory
    • Run make dc
    • Run docker-compose up
    • In another terminal window run go test ./...

Authors

License

The GNames code is released under MIT license.

gnmatcher's People

Contributors

dimus avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gnmatcher's Issues

Add exact string match for viruses

We do not parse virus names, so the best possible solution at the moment is to match virus name strings exactly. Virus in this case is not used from scientific point of view. Besides viruses we include phages, prions, plasmids etc.

can I use gnmatcher to match two local lists?

I am just wondering if gnmatcher is the right tool to use (offline) for this purpose:

I have two lists of scientific names, and I want to produce a 3rd list of matches.

  • List1 is just "correct" (i.e., an already curated national checklist with 50000 names).
  • List2 contains about 4000 names expected to be in List1, but they could contain mispellings.
    I just want to compare names in List2 against those in List1, and receive this back:
  • List3 of 4000 items (either correct names as they matched from List1, or "couldn't match" flags).
    I suppose the program has to be feeded with List1, List2 and also with a numeric parameter (minimum percentage of similarity required for a positive match, or something like that).

Example:

master list (L1):
Homo sapiens L.
Felis catus L.
Canis familiaris L.
Canis lupus L.
Sus domesticus Erxleben
Sus scrofa L.
Ovis aries L.
Bos taurus L.
Panthera leo
Gallus gallus L.
Equus asinus L.
Apis mellifera L.
Anas platyrhynchos L.
names to match (L2) ==> matcher(L1, L2, min=0.9) ==> matches (L3)
Bos taurus Linn. ==> Bos taurus L.
Gallus asinus ==> couldn't match
Momo sapiens ==> Homo sapiens L.
Sus domesticus L. ==> couldn't match

I would like to use Python system calls (Windows OS) to pass the lists and other parameters and receive back results.

Is gnmatcher (or any other gnames software) a good choice for doing this?
If not, could you recommend another options?

The main thing here is that I want to reconcile a list of (possibly misspelled) names against my own-curated list (not against any web services).

I suppose this is not difficult to implement in pure Python using some kind of fuzzy string matching library.
But I am already calling other gnames tools from Python (i.e. gnparser is returning results really fast).
So I thought it could be possible to find a tool which I can call for doing this task the same way.

do not look for full canonical

I think we do want to show matches of a full canonical that either do not have rank, or have a wrong rank. They should never appear in best results, but they do have to appear in the rest of results.

option to match by species epithet in case of NoMatch for infraspecific autonyms (also in gndiff)

This is not a bug but a couple of feature requests. I guess it may be common need for other people.
I am finally beginning to use gnames for resolving long lists of names, and I have a problem with nominotypical infraspecies (a little more here).
Consider these two plant names:

As you can see, there are only a couple of datasources matches for the last one (BTW see gnames/gnames#92), whereas 15 matches for the first one.
This is because most datasources do not currently accept infraspecific taxa subordinated to the species. So they resolve them as synonyms of some other species.
That's fine, but in many cases the datasource authors didn't consider necessary to include one more record entry to also resolve the typical infraspecies so it points against its parent species.
So, we get no matches when we look for these "typical infraspecies" in many trustable datasources, even when the species name exists in most of them. Of course a human brain would immediately decide to take the parent specific rank name as a good match. But gnames of course does not (as expected).

That's a first problem (for my use case) because users are forced to make a second run of all these names against gnames after discarding their infraspecific epithets, in order to get a match in preferred datasources.
We can't do this in the first run (discarding all autonyms in our original list), because then we would miss all those cases where infraspecific autonyms are actually present in data sources.

A second problem is that some datasources which do show those autonym names, do not provide the author's name (this is correct, whereas some other do provide the species' author (which IMHO is a good decision). That causes some inconsistencies to take in account when producing a checklist because you would like all names displayed the same way (in my case, with an authorship provided by the preferred datasource, even if this is provided at the species level).

So, I would request this both for gnames api and also for future gndiff lists matching:

  1. OPTION: if an infraspecific autonym returns NoMatch, then go for just matching against the specific name (datasource species epithet = requested species epithet = requested infraspecies epithet) and return that instead. Please, adding informative flag in the output for the cases where this happened.
  2. OPTION: If an infraspecific autonym has no authorship, then return the species autor (i.e. the author which would be returned by option 1 above). Also flagged when this happened.
    And make it happen whether the match is "real" (infraspecific autonym present in datasource, but without an authorship) or "forced" by the use of option 1. So no name is returned without authorship if possible.

As for how to name these options and the output flags, I have no preferences. Just in case you need ideas:

  • Request option: Infraspecific autonyms match species: IAMS=True (default=False)
  • Request option: Infraspecific autonyms grab species authorship: IASA=True (default=False)
  • Output flag: infraspecific autonym matched against species: IAMS=True
  • Output flag: infraspecific autonym grabbed species authorship: IASA=True

Perhaps both options are a bit opinionated but I think many users could prefer this kind of match to a NoMatch.
And they both can save many unnecessary second run api requests (so much easier application design for final users).
This also means reducing server load in terms of number of requests, which could be good even if that increases response time to each request (but only when the options are used).

Thanks a lot in advance

No name should match with edit distance `-1`

Edit distance -1 is an internal flag. It means that edit distance was higher than allowed threshold. It should not appear in the results.

Currently it appears for "Echium flavo flore"

Some misspellings cannot be found

Reflects issue gnames/gnverifier#83

@abubelinha writes
input name: Isoetes longisima (bad-spelled: "s" instead of "ss")
resolver's output: 3 expected matches ("Fuzzy match by canonical form"): Isoetes longissima in all data sources
verifier's output: 6 unexpected matches (all "PartialExact"), all really odd to me (either genus rank, or an unrelated hybrid name ... but none of them matches with Isoetes longissima). Not even the bestResult is useful (PartialExact match against genus Isoetes)

improve logs

Move logs from logrus to zerologs, add more information to NSQ logs

As a User I want to be able to get names data out of IPFS if I cannot get access to database.

Generating data from database either needs an access to it from one of our servers, or it requires to download data dump and restore it on a local database. Both are hard to do. If there is no access to the database, download an archived version of bloom filters and trie from IPFS using a stable IPNS link that comes from configuration file. Use IPFS API to get the data.

As a Developer I want to refactor matching and config codes.

It would be better to make matching its own package, so that top gnparser package has no matching logic and depends on the matching package. It would allow to organize the code better, but would require creating a special Match structure to deal with dictionaries. We need a new Init function to
initialize this Match structure.

Config needs to be moved to its own package too, so all the packages can use config data without going up to the top package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.