gnames / gnparser Goto Github PK

View Code? Open in Web Editor NEW

37.0 7.0 4.0 2.99 MB

GNparser normalises scientific names and extracts their semantic elements.

License: MIT License

Dockerfile 0.05% Makefile 0.91% Go 86.79% CSS 9.73% HTML 1.55% Ragel 0.64% Nix 0.33%

parser biodiversity bioinformatics nomenclature scientific-names

gnparser's Introduction

Global Names Parser: GNparser written in Go

Try GNparser online.

Try GNparser with OpenRefine

GNparser splits scientific names into their semantic elements with an associated meta information. Parsing is indispensable for matching names from different data sources, because it can normalize different lexical variants of names to the same canonical form.

This parser, written in Go, is the 3rd iteration of the project. The first, biodiversity, had been written in Ruby, the second, also gnparser, had been written in Scala. This project is now a substitution for the other two. Scala project is in an archived state, biodiversity now uses Go code for parsing. All three projects were developed as a part of Global Names Architecture Project.

To use GNparser as a command line tool under Windows, Mac or Linux, download the latest release, uncompress it, and copy gnparser binary somewhere in your PATH. On a Mac you might also need to go to System Preferences and security panel select Allow from other developers. Then, after running gnparser, click 'Yes' in a dialog box allowing to run a program from an "unregistered developer".

tar xvf gnparser-v1.0.0-linux.tar.gz
sudo cp gnparser /usr/local/bin
# for CSV output
gnparser "Homo sapiens Linnaeus"
# for TSV output
gnparser -f tsv "Homo sapiens Linnaeus"
# for JSON output
gnparser -f compact "Homo sapiens Linnaeus"
gnparser -f compact "Homo sapiens Linnaeus" | jq
# or
gnparser -f pretty "Homo sapiens Linnaeus"
gnparser -h

Citing
Introduction
Speed
Features
Use Cases
Tutorials
Installation
Usage
Parsing ambiguities
- Names with filius (ICN code)
- Names with subgenus (ICZN code) and genus author (ICN code)
Authors
Contributors
References
License

Citing

Zenodo DOI can be used to cite GNparser

Introduction

Global Names Parser or GNparser is a program written in Go for breaking up scientific names into their elements. It uses peg -- a Parsing Expression Grammar (PEG) tool.

Many other parsing algorithms for scientific names use regular expressions. This approach works well for extracting canonical forms in simple cases. However, for complex scientific names and to parse scientific names into all semantic elements, regular expressions often fail, unable to overcome the recursive nature of data embedded in names. By contrast, GNparser is able to deal with the most complex scientific name-strings.

GNparser takes a name-string like Drosophila (Sophophora) melanogaster Meigen, 1830 and returns parsed components in CSV, TSV or JSON format. The parsing of scientific names might become surprisingly complex and the GNparser's test file is a good source of information about the parser's capabilities, its input and output.

GNparser reached a stable v1. Differences between v1 and v0

Speed

Number of names parsed per second on an AMD Ryzen 7 5800H CPU (8 cores, 16 threads), GNparser v1.3.0:

gnparser 1_000_000_names.txt -j 200 > /dev/null

Threads	names/sec
1	9,000
2	19,000
4	35,000
8	56,000
16	82,000
100	107,000
200	111,000

For simplest output Go GNparser is roughly 2 times faster than Scala GNparser and about 100 times faster than pure Ruby implementation. For JSON formats the parser is approximately 8 times faster than Scala one, due to more efficient JSON conversion.

Features

Fastest parser ever.
Very easy to install, just placing executable somewhere in the PATH is sufficient.
Extracts all elements from a name, not only canonical forms.
Works with very complex scientific names, including hybrid formulas.
Includes RESTful service and interactive web interface.
Can run as a command line application.
Can be used as a library in Go projects.
Can be scaled to many CPUs and computers (if 250 millions names an hour is not enough).
Calculates a stable UUID version 5 ID from the content of a string.
Provides C-binding to incorporate parser to other languages.

Use Cases

Getting the simplest possible canonical form

Canonical forms of a scientific name are the latinized components without annotations, authors or dates. They are great for matching lexical variants of names. Three versions of canonical forms are included:

Canonical	Example	Use
-	Spiraea alba var. alba Du Roi	Best for disambiguation, but has many lexical variants
Full	Spiraea alba var. alba	Presentation, infraspecies disambiguation
Simple	Spiraea alba alba	Name matching, presentation
Stem	Spiraea alb alb	Best for matching fem./masc. inconsistencies

The canonicalName -> full is good for presentation, as it keeps more details.

The canonicalName -> simple field is good for matching names from different sources, because sometimes dataset curators omit hybrid sign in named hybrids, or remove ranks for infraspecific epithets.

The canonicalName -> stem field normalizes simple canonical form even further. It allows to match names with inconsistent gender suffixes in specific epithets (for example alba vs. albus). The normalization is done according to stemming rules for Latin language described in Schinke R et al (1996). For example letters j are converted to i, letters v are converted to u, and suffixes are removed from the specific and infraspecific epithets.

If you only care mostly about canonical form of a name you can use default --format csv flag with command line tool.

CSV/TSV output has the following fields:

Field	Meaning
Id	UUID v5 generated out of Verbatim
Verbatim	Input name-string without any changes
Cardinality	0 - N/A, 1 - Uninomial, 2 - Binomial etc.
CanonicalStem	Simplest canonical form with removed suffixes
CanonicalSimple	Simplest canonical form
CanonicalFull	Canonical form with hybrid sign and ranks
Authors	Authorship of a name
Year	Year of the name (if given)
Quality	Parsing quality

Quickly partition names by the type

Usually scientific names can be broken into groups according to the number of elements:

Uninomial
Binomial
Trinomial
Quadrinomial

The output of GNparser contains a Cardinality field that tells, when possible, how many elements are detected in the name.

Cardinality	Name Type
0	Undetermined
1	Uninomial
2	Binomial
3	Trinomial
4	Quadrinomial

For hybrid formulas, "approximate" names (with "sp.", "spp." etc.), unparsed names, as well as names from BOLD project cardinality is 0 (Undetermined)

Normalizing name-strings

There are many inconsistencies in how scientific names may be written. Use normalized field to bring them all to a common form (spelling, spacing, ranks).

Removing authorship from the middle of the name

Often data administrators spit name-strings into "name part" and "authorship part". This practice misses some information when dealing with names like "Prosthechea cochleata (L.) W.E.Higgins var. grandiflora (Mutel) Christenson". However, if this is the use case, a combination of canonicalName -> full with the authorship from the lowest taxon will do the job. You can also use the default --format csv flag for gnparser command line tool.

Figuring out if names are well-formed

If there are problems with parsing a name, parser generates qualityWarnings messages and lowers parsing quality of the name. Quality values mean the following:

"quality": 1 - No problems were detected.
"quality": 2 - There were small problems, normalized result should still be good.
"quality": 3 - There are some significant problems with parsing.
"quality": 4 - There were serious problems with the name, and the final result is rather doubtful.
"quality": 0 - A string could not be recognized as a scientific name and parsing failed.

Creating stable GUIDs for name-strings

GNparser uses UUID version 5 to generate its id field. There is algorithmic 1:1 relationship between the name-string and the UUID. Moreover the same algorithm can be used in any popular language to generate the same UUID. Such IDs can be used to globally connect information about name-strings or information associated with name-strings.

More information about UUID version 5 can be found in the Global Names blog

Assembling canonical forms etc. from original spelling

GNparser tries to correct problems with spelling, but sometimes it is important to keep original spelling of the canonical forms or authorship. The words field attaches semantic meaning to every word in the original name-string and allows users to create canonical forms or other combinations using the original verbatim spelling of the words. Each element in words contains 3 parts:

verbatim value of a word
semantic meaning of the word
start position of the word
end position of the word

The words section belongs to additional details. To use it enable --details flag for the command line application.

gnparser -d "Pardosa moesta Banks, 1892"

Tutorials

Parsing names from CSV files tutorial

Installation

Compiled programs in Go are self-sufficient and small (GNparser is only a few megabytes). As a result the binary file of gnparser is all you need to make it work. You can install it by downloading the latest version of the binary for your operating system, and placing it in your PATH.

Install with Homebrew (Mac OS X, Linux)

Homebrew is a packaging system originally made for Mac OS X. You can use it now for Mac, Linux, or Windows X WSL (Windows susbsystem for Linux).

Install Homebrew according to their instructions.

Install gnparser with:

brew tap gnames/gn
brew install gnparser

Linux or Mac OS X

Move gnparser executable somewhere in your PATH (for example /usr/local/bin)

sudo mv path_to/gnparser /usr/local/bin

Windows

One possible way would be to create a default folder for executables and place gnparser there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\bin
copy path_to\gnparser.exe C:\bin

Add C:\bin directory to your PATH user and/or system environment variables.

It is also possible to install Windows Subsystem for Linux on Windows (v10 or v11), and use gnparser as a Linux executable.

Install with Go

If you have Go installed on your computer use

go get -u github.com/gnames/gnparser/gnparser

For development install gnu make and use the following:

git clone https://github.com/gnames/gnparser.git
cd gnparser
make tools
make install

You do need your PATH to include $HOME/go/bin

Usage

Command Line

gnparser -f pretty "Quadrella steyermarkii (Standl.) Iltis &amp; Cornejo"

Relevant flags:

--help -h : help information about flags.

--batch_size -b : Sets a maximum number of names collected into a batch before processing. This flag is ignored if parsing mode is set to streaming with -s flag.

--cultivar -C : Adds support for botanical cultivars like Sarracenia flava 'Maxima' and graft-chimaeras like + Crataegomespilus

--capitalize -c : Capitalizes the first letter of name-strings.

--details -d : Return more details for a parsed name. This flag is ignored for CSV/TSV formatting.

--diaereses -D : Preserves diaereses within names, e.g. Leptochloöpsis virgata. The stemmed canonical name will be generated without diaereses.

--format -f : output format. Can be csv, tsv, compact, pretty. Default is csv.

CSV and TSV formats return a header row and the CSV/TSV-compatible parsed result.

--jobs -j : number of jobs running concurrently.

--ignore_tags -i : keeps HTML entities and tags if they are present in a name-string. If your data is clean from HTML tags or entities, you can use this flag to increase performance.

--port -p : set a port to run web-interface and RESTful API.

--nsqd-tcp : requires --port. Allows to redirect web-service log output to NSQ messaging server's TCP-based endpoint. It is handy for aggregations of logs from GNparser web-services running inside of Docker containers or in Kubernetes pods.

--species-group-cut : Changes stemmed canonical for autonym or species group names (e.g. Aus bus bus). It cuts infraspecific epithet, leaving only genus and specific epithet. All other data stays the same. This feature might be useful to match names like Aus bus and Aus bus bus.

--stream -s : GNparser can be used from any language using pipe-in/pipe-out of the command line application. This approach requires sending 1 name at a time to GNparser instead of sending names in batches. Streaming allows to achieve that.

--unordered -u : does not restore the order of output according to the order of input.

--version -V : shows the version number of GNparser.

--web-logs : requires --port. Enables output of logs for web-services.

To parse one name:

# CSV output (default)
gnparser "Parus major Linnaeus, 1788"
# or
gnparser -f csv "Parus major Linnaeus, 1788"

# TSV output
gnparser -f tsv "Parus major Linnaeus, 1788"

# JSON compact format
gnparser "Parus major Linnaeus, 1788" -f compact

# pretty format
gnparser -f pretty "Parus major Linnaeus, 1788"

# to parse a name from the standard input
echo "Parus major Linnaeus, 1788" | gnparser

# to parse a botanical cultivar name
gnparser "Anthurium 'Ace of Spades'" --cultivar
gnparser "Phyllostachys vivax cv aureocaulis" -c

# to parse name that is all in low-case
gnparser "parus major" --capitalize
gnparser "parus major" -c

To parse a file:

There is no flag for parsing a file. If parser finds the given file path on your computer, it will parse the content of the file, assuming that every line is a new scientific name. If the file path is not found, GNparser will try to parse the "path" as a scientific name.

Parsed results will stream to STDOUT, while progress of the parsing will be directed to STDERR.

# to parse with 200 parallel processes
gnparser -j 200 names.txt > names_parsed.csv

# to parse file with more detailed output
gnparser names.txt -d -f compact > names_parsed.txt

# to parse files using pipes
cat names.txt | gnparser -f csv -j 200 > names_parsed.csv

# to parse using `stream` method instead of `batch` method.
cat names.txt | gnparser -s > names_parsed.csv

# to not remove html tags and entities during parsing. You gain a bit of
# performance with this option if your data does not contain HTML tags or
# entities.
gnparser "<i>Pomatomus</i>&nbsp;<i>saltator</i>"
gnparser -i "<i>Pomatomus</i>&nbsp;<i>saltator</i>"
gnparser -i "Pomatomus saltator"

If jobs number is set to more than 1, parsing uses several concurrent processes. This approach increases speed of parsing on multi-CPU computers. The results are returned in some random order, and reassembled into the order of input transparently for a user.

Potentially the input file might contain millions of names, therefore creating one properly formatted JSON output might be prohibitively expensive. Therefore the parser creates one JSON line per name (when compact format is used)

You can use up to 20 times more "threads" than the number of your CPU cores to reach maximum speed of parsing (--jobs 200 flag). It is practical because additional "threads" are very cheap in Go and they try to fill out every idle gap in the CPU usage.

Pipes

About any language has an ability to use pipes of the underlying operating system. From the inside of your program you can make the CLI executable GNparser to listen on a STDIN pipe and produce output into STDOUT pipe. Here is an example in Ruby:

def self.start_gnparser
  io = {}

  ['compact', 'csv'].each do |format|
    stdin, stdout, stderr = Open3.popen3("./gnparser -s --format #{format}")
    io[format.to_sym] = { stdin: stdin, stdout: stdout, stderr: stderr }
  end
end

@marcobrt kindly provided an example in PHP.

Note that you have to use --stream -s flag for this approach to work.

R language package

For R language it is possible to use rgnparser package. It implements mentioned above pipes method. It does require gnparser app be installed.

Ruby Gem

Ruby developers can use GNparser functionality via biodiversity gem. It uses C-binding and does not require an installed gnparser app.

Node.js

@tobymarsden created a wrapper for node.js. It uses C-binding and does not require an installed gnparser app.

Usage as a REST API Interface or Web-based User Graphical Interface

Web-based user interface and API are invoked by --port or -p flag. To start web server on http://0.0.0.0:9000

gnparser -p 9000

Opening a browser with this address will now show an interactive interface to parser. API calls would be accessible on http://0.0.0.0:9000/api/v1/.

The api is and schema are described fully using OpenAPI specification.

Make sure to CGI-escape name-strings for GET requests. An '&' character needs to be converted to '%26'

GET /api?q=Aus+bus|Aus+bus+D.+%26+M.,+1870
POST /api with request body of JSON array of strings

require 'json'
require 'net/http'

uri = URI('https://parser.globalnames.org/api/v1/')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Post.new(uri, 'Content-Type' => 'application/json',
                                   'accept' => 'json')
request.body = ['Solanum mariae Särkinen & S.Knapp',
                'Ahmadiago Vánky 2004'].to_json
response = http.request(request)

Enabling logs for GNparser's web-service

There are several ways to enable logging from a web service.

The following enables web-access logs to be printed to STDERR

gnparser -p 80 --web-logs

This next settings allows to send logs to a NSQ messaging service. This option allows aggregation logs from several instances of GNparser together. It is a great way for log aggregation and analysis if the instances run inside Docker containers or as Kubernetes Pods.

gnparser -p 80 --nsqd-tcp=127.0.0.1:4150

An important note: the address must point to the TCP service of nsqd.

To enable logs to be sent to STDERR and NSQ run

gnparser -p 80 --web-logs --nsqd-tcp=127.0.0.1:4150

Use as a Docker image

You need to have docker runtime installed on your computer for these examples to work.

# run as a website and a RESTful service
docker run -p 0.0.0.0:80:8080 gnames/gognparser -p 8080 --web-logs

# just parse something
docker run gnames/gognparser "Amaurorhinus bewichianus (Wollaston,1860) (s.str.)"

Use as a library in Go

import (
  "fmt"

  "github.com/gnames/gnparser"
  "github.com/gnames/gnparser/ent/parsed"
)

func Example() {
  names := []string{"Pardosa moesta Banks, 1892", "Bubo bubo"}
  cfg := gnparser.NewConfig()
  gnp := gnparser.New(cfg)
  res := gnp.ParseNames(names)
  fmt.Println(res[0].Authorship.Normalized)
  fmt.Println(res[1].Canonical.Simple)
  fmt.Println(parsed.HeaderCSV(gnp.Format()))
  fmt.Println(res[0].Output(gnp.Format()))
  // Output:
  // Banks 1892
  // Bubo bubo
  // Id,Verbatim,Cardinality,CanonicalStem,CanonicalSimple,CanonicalFull,Authorship,Year,Quality
  // e2fdf10b-6a36-5cc7-b6ca-be4d3b34b21f,"Pardosa moesta Banks, 1892",2,Pardosa moest,Pardosa moesta,Pardosa moesta,Banks 1892,1892,1
}

Use as a shared C library

It is possible to bind GNparser functionality with languages that can use C Application Binary Interface. For example such languages include Python, Ruby, Rust, C, C++, Java (via JNI).

To compile GNparser shared library for your platform/operating system of choice you need GNU make and GNU gcc compiler installed:

make clib
cd binding
cp libgnparser* /path/to/some/project

As an example how to use the shared library check this StackOverflow question and biodiversity Ruby gem.

Parsing ambiguities

Some name-strings cannot be parsed unambiguously without some additional data.

Names with `filius` (ICN code)

For names like Aus bus Linn. f. cus the f. is ambiguous. It might mean that species were described by a son of (filius) Linn., or it might mean that cus is forma of bus. We provide a warning "Ambiguous f. (filius or forma)" for such cases.

Names with subgenus (ICZN code) and genus author (ICN code)

For names like Aus (Bus) L. or Aus (Bus) cus L. the (Bus) token would mean the name of subgenus for ICZN names, but for ICN names it would be an author of genus Aus. We created a list of ICN generic authors using data from IRMNG to distinguish such names from each other. For detected ICN names we provide a warning "Possible ICN author instead of subgenus".

Authors

Dmitry Mozzherin

Contributors

If you want to submit a bug or add a feature read CONTRIBUTING file.

References

Mozzherin, D.Y., Myltsev, A.A. & Patterson, D.J. “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar. BMC Bioinformatics 18, 279 (2017).https://doi.org/10.1186/s12859-017-1663-3

Rees, T. (compiler) (2019). The Interim Register of Marine and Nonmarine Genera. Available from http://www.irmng.org at VLIZ. Accessed 2019-04-10

License

Released under MIT license

gnparser's People

Contributors

Stargazers

Watchers

Forkers

locodelassembly marcobrt amazingplants rainhead

gnparser's Issues

Misc inconsistencies output

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/17

Add 'ranked output' to gRPC

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/32

We had a one day internal hackathon and came up with a format that would be useful for taxonworks and probably for some other projects. It has more flattened format that is mostly ranks for genus, species, var, form, etc. Now I need to make a gRPC method for it and add this format to gnparser Ruby gem

natio rank parsed as epithet

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/45

Acipenser gueldenstaedti colchicus natio danubicus Movchan, 1967, a real name from WoRMS, is a quadrinomial of legacy rank natio.

GlobalNamesArchitecture/gnparser#413

Sanctioning authors not parsed

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/47

Fungal names often have a sanctioning author (Fr. or Pers.) following a colon after the basionym or combination authorship. This is currently unparsed.

Example: Boletus versicolor L. : Fr.
ICNfap Article 50: http://www.iapt-taxon.org/nomen/main.php?page=r50E&emph=sanctioned
https://en.wikipedia.org/wiki/Sanctioned_name

GlobalNamesArchitecture/gnparser#409

Implement gRPC service

Cover names where combination authorship is missing one parenthesis

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/40

Aus bus (Auth
Aus bus Auth)
Aus bus (Auth 1888 Auth2

Create outputs up to hybrids

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/5

Generate Abstract Syntax Tree that is compatible with output

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/1

As a User I want the output to be in a more logical order of items.

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/25

Decide what to do with names where species epithet has multiple dashes

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/49

Saccharomyces cerevisiae-agavica-sylvestre Carbajal, 1901

For now we treat them as unparseable tail

Support agamosp. agamossp. agamovar. ranks

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/27

As a developer I want to have tests for command line applet

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/42

There are many different functions now in CLI app and they all need to be tested.

Cover first 40 names from tests

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/3

"Pereskia subg. Maihuenia Philippi ex F.A.C.Weber, 1898" should be parsed with subg. rank

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/46

Pereskia subg. Maihuenia Philippi ex F.A.C.Weber, 1898 means that subgenus Maihuenia is included into
genus Pereskia.

We need to add subg. as a rank for non-species "binomials" and then the canonical form for this name
should be Maihuenia instead of Pereskia

GlobalNamesArchitecture/gnparser#456

Add html entities and tags normalization

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/34

Create a docker image and publish it for every new release

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/29

As a User I want to use pipes for parsing names

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/41

It is often useful in command line environment to chain different processes together with pipes. Currently parser supports only one name parsing, which is pretty useless. I want to be able to do something like:

gnparser -c durty_names.txt | gnparser -f pretty -j 300 > result.txt

As a gRPC User I want to get data back in the same order it is given if speed is sufficient

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/35

CamelCase normalize rule names

created by @mjy at https://gitlab.com/gogna/gnparser/-/issues/26

As a developer I want to see rule names in grammar.peg names the same way.

Add REST API and a web server

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/33

Port web server from Scala gnparser to Go gnparser

Output for hybrids

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/7

As a User I want to make sure that parser works reliably and creates expected results.

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/22

As a User I want to use html-cleaning and underscore substitution to spaces in my workflow as an option

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/37

I will experiment here and try to add underscore to the parser itself. If all goes well, there will be cleaning task that will only for on removing html tags and html entities from names for now.

AST for open taxonomy

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/11

Author group should always be separated either by ``emend`` or by ``ex``

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/19

Add lookup for bacteria dictionaries

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/13

Syntax trees for hybrids

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/6

As a Developer I need clear instructions how to contribute to gnparser, and how to install development environment, or fork the project.

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/39

add debug output for development purposes

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/2

output up to open taxonomy

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/10

As a Developer I want to see every test from files as a separate test.

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/8

Currently test-data.txt file contains about 500 tests, but they show as one test. I think I can use table test framework from ginkgo to break them into separate tests.

Push creating syntax tree until hybrids

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/4

Write documentation in the code and README

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/30

Parse "Leptodactylus gr. wagneri" correctly

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/50

gr. indicates group (less than genus, more than species).

Currently canonical form we get is Leptodactylus gr. It should be Leptodactylus with wagneri placed
into group field. Not sure where to put this field, have to ask...

GlobalNamesArchitecture/gnparser#383

BOLD surrogates output

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/14

AST up to open taxonomy

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/9

Remove multiple years for simplicity sake

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/23

It is very rare when a name-string has more than 1 year. I am going to remove multiple year output

output for viruses

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/15

Set continuous integration

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/31

As a User I want flexibility in docker image, so I can use it either as a REST service, or gRPC service

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/38

output for open taxonomy

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/12

As a User I want to see correct parsing of hybrid notho- ranks

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43

As a Developer I want to have a tool that allows me to rebuild test file completely.

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/21

It is useful if we need to change output format, or find a bug that affects many tests.

Add newick format normalization (changing underscores to spaces)

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/36

Output for misc years

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/18

Follow transliteration of non-ascii characters as well as reasonably possible

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/48

ICN
60.6. Diacritical signs are not used in scientific names. When names (either new or old) are drawn from words in which such signs appear, the signs are to be suppressed with the necessary transcription of the letters so modified; for example ä, ö, ü become, respectively, ae, oe, ue; é, è, ê become e; ñ becomes n; ø becomes oe; å becomes ao. The diaeresis, indicating that a vowel is to be pronounced separately from the preceding vowel (as in Cephaëlis, Isoëtes), is a phonetic device that is not considered to alter the spelling; as such, its use is optional. The ligatures -æ- and -œ-, indicating that the letters are pronounced together, are to be replaced by the separate letters -ae- and -oe-.

ICZN
32.5.2.1. In the case of a diacritic or other mark, the mark concerned is deleted, except that in a name published before 1985 and based upon a German word, the umlaut sign is deleted from a vowel and the letter "e" is to be inserted after that vowel (if there is any doubt that the name is based upon a German word, it is to be so treated).

We need to decide on a 'less harmful' approach here.

Add explanation for canonical form different uses for 'simple' and 'full' canonical forms.

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44

Make all tests pass

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/20

We are skipping html entities for now, will address them later.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

gnames / gnparser Goto Github PK

gnparser's Introduction

Global Names Parser: GNparser written in Go

Citing

Introduction

Speed

Features

Use Cases

Getting the simplest possible canonical form

Quickly partition names by the type

Normalizing name-strings

Removing authorship from the middle of the name

Figuring out if names are well-formed

Creating stable GUIDs for name-strings

Assembling canonical forms etc. from original spelling

Tutorials

Installation

Install with Homebrew (Mac OS X, Linux)

Linux or Mac OS X

Windows

Install with Go

Usage

Command Line

Pipes

R language package

Ruby Gem

Node.js

Usage as a REST API Interface or Web-based User Graphical Interface

Enabling logs for GNparser's web-service

Use as a Docker image

Use as a library in Go

Use as a shared C library

Parsing ambiguities

Names with filius (ICN code)

Names with subgenus (ICZN code) and genus author (ICN code)

Authors

Contributors

References

License

gnparser's People

Contributors

Stargazers

Watchers

Forkers

gnparser's Issues

Recommend Projects

Recommend Topics

Recommend Org

Names with `filius` (ICN code)