gnames / gnmatcher Goto Github PK

gnmatcher provides fast stemming, fuzzy matching altorithms for matching scientific names

License: MIT License

Makefile 1.66% Go 97.40% Dockerfile 0.23% Shell 0.71%

gnmatcher's Introduction

GNames

The goal of the GNames project is to provide an accurate and fast verification of scientific names in unlimited quantities. The verification should be fast (at least 1000 names per second) and include exact and fuzzy matching of input strings to scientific names aggregated from a large number of data-sources.

In case if you do not need exact records of matched names from data-sources, and just want to know if a name-string is known, you can use GNmatcher instead of this project. The GNmatcher is significantly faster and has simpler output.

Features
Installation
- Installation prerequesites
- Installation process
Configuration
Usage as API
Usage with GNverifier
Web-Logs
Known limitations of the verification
Development
Authors
License

Features

Fast verification of unlimited number of scientific names.
Multiple levels of verification:
- Exact matching (exact string match for viruses, exact canonical form match for Plantae, Fungi, Bacteria, and Animalia).
- Fuzzy matching detects human and/or Optical Character Recognition (OCR) errors without producing large number of false positives. To avoid false positives uninomial names only checked for exact match.
- PartialExact matching happens when a match for the full name-string is not found. In such cases middle or end words are removed and each variant is verified. Matches of names with the last word intact does have a preference.
- PartialFuzzy matching is provided for partial matches of species and infraspecies. To avoid false positives uninomials only checked for exact match.
- Virus matching provides viruses verification.
- FacetedSearch allows to use flexible query language for searching.
Providing names information from data-sources that contain a particular name.
- Returning the "best" result. The BestResult is calculated by a scoring algorithm.
- Optionally, limiting results to data-sources that are important to a GNames user.
Providing outlink URLs to some data-sources websites to show the original record of a name.
Providing meta-information about aggregated data-sources.

Installation

Most of the users do not need to install GNames and can use remote GNames API service at http://verifier.globalnames.org/api/v1 or use a command line client GNverifier. Nevertheless, it is possible to install a local copy of the service.

Installation prerequesites

A Linux-based operating system.
At least 32GB of memory.
At least 50GB of a free disk space.
Fast Internet connection during installation. After installation GNames can operate without remote connection.
PostgreSQL database.

Installation process

PostgreSQL

We are not covering basics of PostgreSQL administration here. There are many tutorials and resources for Linux-based operating systems that can help.

Create a database named gnames. Download the gnames database dump. Restore the database with:
```
gunzip -c gnames_latest.tar.gz |pg_restore -d gnames
```
GNmatcher

Refer to the GNmatcher documentation for its installation.
GNames

Download the latest release of GNames, unpack it and place somewhere in the PATH.

Run gnames -V. It will show you the version of GNames and also generate $HOME/.config/gnames.yaml configuration file.

Edit $HOME/.config/gnames.yaml according to your preferences.

Try it by running
```
gnames rest -p 8888
```
To load service automatically you can create systemctl configuration for the service, if your system supports systemctl.

Alternatively you can use docker image to run GNames. You will need to create a file with corresponding environment variables that are described in the .env.example file.
```
docker pull gnames/gnames:latest
docker run -env_file path_to_env_file -d -i -t -p 8888:8888 \
  gnames/gnames:latest rest -p 8888
```
We provide an example of environment file. Environment variables override configuration file settings.

Configuration

Configuration settings can either be given in the config file located at $HOME/.config/gnames.yaml, or by setting the following environment variables:

Env. Var.	Configuration
GN_CACHE_DIR	CacheDir
GN_JOBS_NUM	JobsNum
GN_MATCHER_URL	MatcherURL
GN_MAX_EDIT_DIST	MaxEditDist
GN_PG_DB	PgDB
GN_PG_HOST	PgHost
GN_PG_PASS	PgPass
GN_PG_PORT	PgPort
GN_PG_USER	PgUser
GN_PORT	Port

The meaning of configuration settings are provided in the default gnames.yaml.

Usage as API

Please note, that currently developed API (documentation) is publically served at https://verifier.globalnames.org/api/v1.

If you installed GNames locally and want to run its API, run:

gnames rest
# to change from default 8888 port
gnames rest -p 8787

Refer to GNames' RESTful API Documentation about interacting with GNames API.

Usage with GNverifier

GNverifier is a command line client for GNames backend. It uses publically available remote API of GNames. Install and use it according to the GNverifier documentation.

GNverifier also provides web-based user interface to GNames. To launch it use something like:

gnverifier -p 8777

Known limitations of the verification

Exact matches of misspellings that might exist in poorly curated databases prevent to find fuzzy matches from better curated sources.

To increase performance we stop any further tries if a name matched
successfully. This prevents fuzzy-matching if a misspelled name is found
somewhere. It is helpful to check 'curation' field of returned result,
and see how many data-sources do contain the name.

Fuzzy matching of a name where genus string is broken by a space.

For example, we cannot match 'Abro stola triplasia' to 'Abrostola triplasia'. There is only 1 edit distance between the strings, however we stem specific epithets, so in reality we fuzzy-match 'Abro stol triplas' to 'Abrostola triplas'. That means now we have edit distance 2 which is usually beyond our threshold.

Development

Install Go language for your Linux operating system.
Create PostgreSQL database as described in installation.
Clone the GNames code.
Clone the GNmatcher and set it up for development.
Install docker and docker compose.
Go to your local gnames directory
- Run make dc
- Run docker-compose up
- In another terminal window run go test ./...

Authors

Dmitry Mozzherin

License

The GNames code is released under MIT license.

gnmatcher's People

Contributors

Stargazers

Watchers

gnmatcher's Issues

Add exact string match for viruses

We do not parse virus names, so the best possible solution at the moment is to match virus name strings exactly. Virus in this case is not used from scientific point of view. Besides viruses we include phages, prions, plasmids etc.

"Acacia horrida nur" does not match "Acacia horria" as PartialExact Match.

"Abro stola triplasia" should fuzzy-match "Abrostola triplasia Linnaeus"

As a User I want matching to take in account data-sources that are being matched.

Stems that are the same as canonical form are missing

For example "Drosophila melanogaster" cannot be fuzzy searched because it is not in the gnames database as a stem.

related issue: gnames/gnidump#8

can I use gnmatcher to match two local lists?

I am just wondering if gnmatcher is the right tool to use (offline) for this purpose:

I have two lists of scientific names, and I want to produce a 3rd list of matches.

List1 is just "correct" (i.e., an already curated national checklist with 50000 names).
List2 contains about 4000 names expected to be in List1, but they could contain mispellings.
I just want to compare names in List2 against those in List1, and receive this back:
List3 of 4000 items (either correct names as they matched from List1, or "couldn't match" flags).
I suppose the program has to be feeded with List1, List2 and also with a numeric parameter (minimum percentage of similarity required for a positive match, or something like that).

Example:

master list (L1):
Homo sapiens L.
Felis catus L.
Canis familiaris L.
Canis lupus L.
Sus domesticus Erxleben
Sus scrofa L.
Ovis aries L.
Bos taurus L.
Panthera leo
Gallus gallus L.
Equus asinus L.
Apis mellifera L.
Anas platyrhynchos L.

names to match (L2)	==> matcher(L1, L2, min=0.9) ==>	matches (L3)
Bos taurus Linn.	==>	Bos taurus L.
Gallus asinus	==>	couldn't match
Momo sapiens	==>	Homo sapiens L.
Sus domesticus L.	==>	couldn't match

I would like to use Python system calls (Windows OS) to pass the lists and other parameters and receive back results.

Is gnmatcher (or any other gnames software) a good choice for doing this?
If not, could you recommend another options?

The main thing here is that I want to reconcile a list of (possibly misspelled) names against my own-curated list (not against any web services).

I suppose this is not difficult to implement in pure Python using some kind of fuzzy string matching library.
But I am already calling other gnames tools from Python (i.e. gnparser is returning results really fast).
So I thought it could be possible to find a tool which I can call for doing this task the same way.

As a User I want Jobs configuration parameter

As a User I want to detect false positives from bloom filters matching

do not look for full canonical

I think we do want to show matches of a full canonical that either do not have rank, or have a wrong rank. They should never appear in best results, but they do have to appear in the rest of results.

As a Developer I need a tool for measuring performance of the matching algorithms.

As a User I want to see if middleware compression makes service faster

Enable fuzzy matching

As a User I want to see partial matching

If we cannot match full name, we try to match parts of it up to genus

bloom filters statt to miss matches if accessed concurently

when bloom filters used in parallel they checks do not work all the time.

As a User I need documentation in Readme, and as a Developer, in the code.

"Bubo bubo" returns as Partial match, instead of Exact match

option to match by species epithet in case of NoMatch for infraspecific autonyms (also in gndiff)

This is not a bug but a couple of feature requests. I guess it may be common need for other people.
I am finally beginning to use gnames for resolving long lists of names, and I have a problem with nominotypical infraspecies (a little more here).
Consider these two plant names:

Species: Narcissus minor L.
Typical subspecies: Narcissus minor L. subsp. minor

As you can see, there are only a couple of datasources matches for the last one (BTW see gnames/gnames#92), whereas 15 matches for the first one.
This is because most datasources do not currently accept infraspecific taxa subordinated to the species. So they resolve them as synonyms of some other species.
That's fine, but in many cases the datasource authors didn't consider necessary to include one more record entry to also resolve the typical infraspecies so it points against its parent species.
So, we get no matches when we look for these "typical infraspecies" in many trustable datasources, even when the species name exists in most of them. Of course a human brain would immediately decide to take the parent specific rank name as a good match. But gnames of course does not (as expected).

That's a first problem (for my use case) because users are forced to make a second run of all these names against gnames after discarding their infraspecific epithets, in order to get a match in preferred datasources.
We can't do this in the first run (discarding all autonyms in our original list), because then we would miss all those cases where infraspecific autonyms are actually present in data sources.

A second problem is that some datasources which do show those autonym names, do not provide the author's name (this is correct, whereas some other do provide the species' author (which IMHO is a good decision). That causes some inconsistencies to take in account when producing a checklist because you would like all names displayed the same way (in my case, with an authorship provided by the preferred datasource, even if this is provided at the species level).

So, I would request this both for gnames api and also for future gndiff lists matching:

OPTION: if an infraspecific autonym returns NoMatch, then go for just matching against the specific name (datasource species epithet = requested species epithet = requested infraspecies epithet) and return that instead. Please, adding informative flag in the output for the cases where this happened.
OPTION: If an infraspecific autonym has no authorship, then return the species autor (i.e. the author which would be returned by option 1 above). Also flagged when this happened.
And make it happen whether the match is "real" (infraspecific autonym present in datasource, but without an authorship) or "forced" by the use of option 1. So no name is returned without authorship if possible.

As for how to name these options and the output flags, I have no preferences. Just in case you need ideas:

Request option: Infraspecific autonyms match species: IAMS=True (default=False)
Request option: Infraspecific autonyms grab species authorship: IASA=True (default=False)
Output flag: infraspecific autonym matched against species: IAMS=True
Output flag: infraspecific autonym grabbed species authorship: IASA=True

Perhaps both options are a bit opinionated but I think many users could prefer this kind of match to a NoMatch.
And they both can save many unnecessary second run api requests (so much easier application design for final users).
This also means reducing server load in terms of number of requests, which could be good even if that increases response time to each request (but only when the options are used).

Thanks a lot in advance

As a User I need to have as good performance as possible, especially for fuzzy matching.

Add options for NSQ logging filters

Refactor entities, move them to gnlib to avoid circular dependencies

As a Developer I want GoDoc look clear and complete

Improve documentation, while checking it at GoDoc, making sure that documentation is clear and well organized.

Prepare gnmatcher to first release (v0.3.0)

try suffix tree algorithm for virus matching

As a User I want very fast matching by canonical forms

We will use Bloom filters to make very fast matching by canonical form.

As a User I want to have a convenience GET method for name matching

As a Developer I need to migrate dependencies to #21 gnames architecture.

Add optional NSQ logger

As a Developer I prefer gnmatcher to return []Match instead of []*Match

There is no real need for final output to be slice of pointers. Removing them cleans up the interface, and make it more reasonably translated to native or remote implementations.

`Teucrium pyrenaicum subsp. guarense` should return fuzzy matches

Also gnames/gnames#90

As a User I want to get all exact matches by stemmed canonical

Currently when exact match is performed results that match stemmed canonical, but have different suffix than input are discarded.
It is useful to have names with different suffix to be part of the match.

This issue is related to conversation at gnames/gnverifier#80

No name should match with edit distance `-1`

Edit distance -1 is an internal flag. It means that edit distance was higher than allowed threshold. It should not appear in the results.

Currently it appears for "Echium flavo flore"

Add openapi description

As a User I want to be able to access gnmatcher via RESTful service

It is hard to parallelize gRPC services because they permanent HTTP/2 connections. We create a protobuf-based HTTP service and see if it brings faster name-resolution/reconciliation.

Some misspellings cannot be found

Reflects issue gnames/gnverifier#83

@abubelinha writes
input name: Isoetes longisima (bad-spelled: "s" instead of "ss")
resolver's output: 3 expected matches ("Fuzzy match by canonical form"): Isoetes longissima in all data sources
verifier's output: 6 unexpected matches (all "PartialExact"), all really odd to me (either genus rank, or an unrelated hybrid name ... but none of them matches with Isoetes longissima). Not even the bestResult is useful (PartialExact match against genus Isoetes)

As a Developer I want to switch to GNparser v1.x

Operate gnmatcher via gRPC

I do have some doubts about NATS, so I will try gRPC instead

Fast exact matching via bloom filters

If all fuzzy matches got filtered out because of too high edit distance, go to the next matching option instead of giving empty result.

As a User I want to use gnames matchtype to simplify the system

gnames MatchType covers everything except 2 things -- match by full canonical form and match of viruses. Add 2 booleans and migrate to MatchType from gnames.

As a User I want to verify 'Candidatus' bacterial names

Set plumbing with NATS messaging system

Uninomials do not match

Names like "Bubo", "Pardosa", "Drosophila" do not match

improve logs

Move logs from logrus to zerologs, add more information to NSQ logs

As a User I want to be able to get names data out of IPFS if I cannot get access to database.

Generating data from database either needs an access to it from one of our servers, or it requires to download data dump and restore it on a local database. Both are hard to do. If there is no access to the database, download an archived version of bloom filters and trie from IPFS using a stable IPNS link that comes from configuration file. Use IPFS API to get the data.

Filter out matches that have more than 1 ED per 6 characters in a word

Acacia horrida nur matches Acacia dura it should not happen. nur and dur stems are too short to have a match.

no matching for viruses, it will be done in a database

Switch to gnames/levenshtein package

As a Developer I want to refactor matching and config codes.

It would be better to make matching its own package, so that top gnparser package has no matching logic and depends on the matching package. It would allow to organize the code better, but would require creating a special Match structure to deal with dictionaries. We need a new Init function to
initialize this Match structure.

Config needs to be moved to its own package too, so all the packages can use config data without going up to the top package.