Giter Site home page Giter Site logo

GNames

API GoDoc

The goal of the GNames project is to provide an accurate and fast verification of scientific names in unlimited quantities. The verification should be fast (at least 1000 names per second) and include exact and fuzzy matching of input strings to scientific names aggregated from a large number of data-sources.

In case if you do not need exact records of matched names from data-sources, and just want to know if a name-string is known, you can use GNmatcher instead of this project. The GNmatcher is significantly faster and has simpler output.

Features

  • Fast verification of unlimited number of scientific names.
  • Multiple levels of verification:
    • Exact matching (exact string match for viruses, exact canonical form match for Plantae, Fungi, Bacteria, and Animalia).
    • Fuzzy matching detects human and/or Optical Character Recognition (OCR) errors without producing large number of false positives. To avoid false positives uninomial names only checked for exact match.
    • PartialExact matching happens when a match for the full name-string is not found. In such cases middle or end words are removed and each variant is verified. Matches of names with the last word intact does have a preference.
    • PartialFuzzy matching is provided for partial matches of species and infraspecies. To avoid false positives uninomials only checked for exact match.
    • Virus matching provides viruses verification.
    • FacetedSearch allows to use flexible query language for searching.
  • Providing names information from data-sources that contain a particular name.
    • Returning the "best" result. The BestResult is calculated by a scoring algorithm.
    • Optionally, limiting results to data-sources that are important to a GNames user.
  • Providing outlink URLs to some data-sources websites to show the original record of a name.
  • Providing meta-information about aggregated data-sources.

Installation

Most of the users do not need to install GNames and can use remote GNames API service at http://verifier.globalnames.org/api/v1 or use a command line client GNverifier. Nevertheless, it is possible to install a local copy of the service.

Installation prerequesites

  • A Linux-based operating system.
  • At least 32GB of memory.
  • At least 50GB of a free disk space.
  • Fast Internet connection during installation. After installation GNames can operate without remote connection.
  • PostgreSQL database.

Installation process

  1. PostgreSQL

    We are not covering basics of PostgreSQL administration here. There are many tutorials and resources for Linux-based operating systems that can help.

    Create a database named gnames. Download the gnames database dump. Restore the database with:

    gunzip -c gnames_latest.tar.gz |pg_restore -d gnames
  2. GNmatcher

    Refer to the GNmatcher documentation for its installation.

  3. GNames

    Download the latest release of GNames, unpack it and place somewhere in the PATH.

    Run gnames -V. It will show you the version of GNames and also generate $HOME/.config/gnames.yaml configuration file.

    Edit $HOME/.config/gnames.yaml according to your preferences.

    Try it by running

    gnames rest -p 8888

    To load service automatically you can create systemctl configuration for the service, if your system supports systemctl.

    Alternatively you can use docker image to run GNames. You will need to create a file with corresponding environment variables that are described in the .env.example file.

    docker pull gnames/gnames:latest
    docker run -env_file path_to_env_file -d -i -t -p 8888:8888 \
      gnames/gnames:latest rest -p 8888

    We provide an example of environment file. Environment variables override configuration file settings.

Configuration

Configuration settings can either be given in the config file located at $HOME/.config/gnames.yaml, or by setting the following environment variables:

Env. Var. Configuration
GN_CACHE_DIR CacheDir
GN_JOBS_NUM JobsNum
GN_MATCHER_URL MatcherURL
GN_MAX_EDIT_DIST MaxEditDist
GN_PG_DB PgDB
GN_PG_HOST PgHost
GN_PG_PASS PgPass
GN_PG_PORT PgPort
GN_PG_USER PgUser
GN_PORT Port

The meaning of configuration settings are provided in the default gnames.yaml.

Usage as API

Please note, that currently developed API (documentation) is publically served at https://verifier.globalnames.org/api/v1.

If you installed GNames locally and want to run its API, run:

gnames rest
# to change from default 8888 port
gnames rest -p 8787

Refer to GNames' RESTful API Documentation about interacting with GNames API.

Usage with GNverifier

GNverifier is a command line client for GNames backend. It uses publically available remote API of GNames. Install and use it according to the GNverifier documentation.

GNverifier also provides web-based user interface to GNames. To launch it use something like:

gnverifier -p 8777

Known limitations of the verification

  • Exact matches of misspellings that might exist in poorly curated databases prevent to find fuzzy matches from better curated sources.

    To increase performance we stop any further tries if a name matched
    successfully. This prevents fuzzy-matching if a misspelled name is found
    somewhere. It is helpful to check 'curation' field of returned result,
    and see how many data-sources do contain the name.
    
  • Fuzzy matching of a name where genus string is broken by a space.

    For example, we cannot match 'Abro stola triplasia' to 'Abrostola triplasia'. There is only 1 edit distance between the strings, however we stem specific epithets, so in reality we fuzzy-match 'Abro stol triplas' to 'Abrostola triplas'. That means now we have edit distance 2 which is usually beyond our threshold.

Development

  • Install Go language for your Linux operating system.
  • Create PostgreSQL database as described in installation.
  • Clone the GNames code.
  • Clone the GNmatcher and set it up for development.
  • Install docker and docker compose.
  • Go to your local gnames directory
    • Run make dc
    • Run docker-compose up
    • In another terminal window run go test ./...

Authors

License

The GNames code is released under MIT license.

gnames's Projects

bayes icon bayes

A simple implementation of Naive Bayesian classifier

bhlclone icon bhlclone

Uses bhlindex service to import BHL data and metadata

bhlindex icon bhlindex

BHLindex is used by Biodiversity Heritage Library to create their scientific names index

bhlnames icon bhlnames

BHLnames finds nomenclatural events, adds taxonomic intelligence to Biodiversity Heritage Library

bhlquest icon bhlquest

BHLquest is an AI app to query the content of Biodiversity Heritage Library

darwincore icon darwincore

Package to load and dump DarwinCoreArchive files

ds-ruhoff-mollusca icon ds-ruhoff-mollusca

Data-Source out of publication "Index to the species of Mollusca introduced from 1850 to 1870" Ruhoff, Florence A. 1980

dwca icon dwca

The DwCA library processes Darwin Core Archive files

ecoml icon ecoml

A reader and writer of Ecological Metadata Language

gn2023 icon gn2023

A proposal to enhance name detection

gnames icon gnames

gnames verifies scientic names with fuzzy matching and speeds of more than 1000 names a second.

gnapi icon gnapi

This project provides OpenAPI documenation for Global Names API services

gnar icon gnar

GNar is a library for extracting files from various archival sources

gnbhl icon gnbhl

BHLitem is a helper tool used in other projects. It deals with BHL Item entities.

gncsv icon gncsv

A fork of Go CSV library that allows to change field border character

gndict icon gndict

gndict creates a scientific names dictionaries to be used in GNfinder

gndict-rst icon gndict-rst

gndict creates dictionaries for gnfinder project

gndiff icon gndiff

GNdiff compares scientific names from two files

gner icon gner

Global Named Entity Recognition

gnfinder icon gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.

gnfinder-dict icon gnfinder-dict

This project contains words that created problems for name-recognition with gnfinder

gnfmt icon gnfmt

format is a utility package, it contains format options, conversions to GoB, JSON, CSV

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.