Giter Site home page Giter Site logo

Comments (28)

joelnitta avatar joelnitta commented on June 16, 2024 1

+1: I am also very interested in this feature (being able to parse and resolve names with a CLI using a local database). I am working on an R package, taxastand, that allows the user to provide their own reference taxonomy in DWC format (e.g., as a CSV file) to resolve against. I think having complete control over the reference taxonomy and search algorithm is key for reproducibility (though priority of this may vary according to user).

My package currently uses the taxon-tools CLI for parsing and matching, but it would be great if the gnames was also available for this as it seems very fast and accurate.

from gnames.

dimus avatar dimus commented on June 16, 2024 1

@joelnitta it is definitely in plans, but not the highest priority. Check out https://github.com/globalbioticinteractions/nomer by @jhpoelen

from gnames.

dimus avatar dimus commented on June 16, 2024 1

Made a ticket #75 to spell out why something percolated up to be "bestResult"

from gnames.

dimus avatar dimus commented on June 16, 2024

Using feedback from GlobalNamesArchitecture/dwca_hunter#30 by @jhpoelen

I'd be happy to help test your tool, especially if the features align with those of Nomer:

easy to install (in my opinion, docker is not easy for most and usually just hides overly complex infrastructures)

it will be docker-based at least as the first iteration, because it fits kubernetes workflow. We can think of other ways after docker one is implemented.

able to index, cache and version existing taxon lists (e.g., index Plazi names from taxon lists published in DwC-A)

Not in the scope currently, would require more thought.

able to batch export / stream link results

yes

command-line interface

https://github.com/gnames/gnmatcher component takes about a minute to get initialized. If it runs as a service, the rest can be set as a cli program.

able to run on single laptop.

yes, for modern laptops. I am playing with an idea to set a high-end 64bit rasberry pi as the lowest supported denominator.

designed for offline (no internet) workflows

yes

separate commands to do common name operations: find, parse, resolve

gnfinder, gnparser, gnames are 3 different projects that do find, parse, resolve correspondingly.

Is this in line with the features you are planning to support?

Some of them yes, others need more thought

from gnames.

mjy avatar mjy commented on June 16, 2024

@jhpoelen Thinking out loud. "Use locally" has design implications that likely need to be teased out. I think what you define is not really "use locally", it's a feature set. If we look at all parts of the spectrum, your old laptop, vs. a supercomputer, vs. a mid range kubernetes cluster each would have some dis/advantages, and each is "local" in some context.

If you mean "I want to do all these things on my 3 year old laptop" then you're going to constraing certain approaches (e.g. perhaps no docker).

I'm also not sure about your argument vs. docker (not specifically, just for/against some tech). Let's imagine a completely naive user. Now we ask them to do some things to "run locally". One might argue that installing a package on Mac, then running a single docker command is more straightforward than installing binaries, updating paths, and typing multiple commands.

I.e., I'm largely on board with the sentiment, but the devil is in the details.

from gnames.

jhpoelen avatar jhpoelen commented on June 16, 2024

hey @dimus @mjy thanks for taking the time to consider my situation: over the years, I have spent a large chunk of my time maintaining name parsing, resolving and linking software libraries. Every once in while, I try to shop around for name tools to re-use existing tools. So far, I've incorporated the scala-based gnparser library and your online resolver.globalnames.org service, but I am eager to re-use more of your specialty tools to find, parse and link taxonomic names.

From @dimus 's answers, it looks like the tools gnfinder, gnparser and gnmatcher (not gnames right?) satisfy my need to have simple, easy to install command-line tools that do one thing well. The only piece that seems to be missing is the ability to locally install and specify the versions of name lists that gnmatch resolves against.

@mjy I much agree that the phrase "use locally" can be interpreted in many ways, especially when taken out of this context. Just to clarify: by use locally, I mean to build tools in the unix tradition. The additional (almost implicit) requirement is to design tools to be used offline.

As far as my old laptop goes: as far as I can tell, nothing much has changed in the last 10 years as far as hardware goes, aside from maybe the increased performance SSD/NVMe to allow for parallel computation workflows. So, when I say "it should run on my old laptop", I mean that the tools should be able to run a run of the mill linux machine with modest memory requirements.

Also, to me, Docker is a way to package a suite of tools to help make it easier to use standard linux features (e.g., cgroups, linux namespaces, copy-on-write) to package, publish and deploy a compilation of programs and configurations. I can see how this can be useful for installing tools with (overly?) complex dependencies and/or for having an easy way to deploy to tens or hundreds of servers. In this discussion, I'd argue that Docker is out of scope, because docker is a complimentary method to alleviate (overly complex?) install procedures.

I am looking forward to figuring out ways to re-use your tools in my workflows, especially for the gnmatch tool: I'd very much like to be able to run a resolver.globalnames.org -like service locally with defined and versioned taxonomy name providers to match against.

Thanks again for taking the time to respond and please let me know if there's any specific information you need to clarify my use case.

from gnames.

dimus avatar dimus commented on June 16, 2024

@jhpoelen here are command line programs that can be used for finding, parsing and verification

parsing: https://gitlab.com/gogna/gnparser

finding: https://github.com/gnames/gnfinder

verification: https://github.com/gnames/gnverify

The gnverify CLI interface is done via remote access to https://index.globalnames.org service, however it is going to change soon and use https://github.com/gnames/gnames instead. I suspect you will be able to use local version of gnames for verification, but it does need to operate with pretty large data, so it will be demanding on memory, cpu and disk space.

gnfinder also uses index.globalnames.org for verification, and will also be moved to gnames for this task.

I will probably continue to keep https://resolver.globalnames.org online for a few years, but eventually I do plan to make it an alias for the future gnames service.

gnmatcher is rather an internal library (and has Scala and Go implementations

from gnames.

jhpoelen avatar jhpoelen commented on June 16, 2024

I suspect you will be able to use local version of gnames for verification, but it does need to operate with pretty large data, so it will be demanding on memory, cpu and disk space.

Assuming that different taxonomic name sources, would it be able to just install a small subset of the data needed to run the verification (same as linking right?). For instance, I'd be interested in only verifying against ITIS, NCBI and WoRMS.

from gnames.

dimus avatar dimus commented on June 16, 2024

Yes, if you set a database locally, you can purge all resources you do not care about. In such case gnmatcher would also use somewhat smaller memory footprint, as you will have slightly smaller number of canonical forms to match against.

from gnames.

dimus avatar dimus commented on June 16, 2024

There is no reason for docker for gnfinder and gnparser. You can also install all components of gnames by hand, but I personally think docker compose makes it easier.

I do think that docker does help because of the tools it provides. Surely similar effect can be achieved using standard Linux features, but it requires more knowledge from users and it is a less portable solution. I strife for the only requirement from users is be able to download a program, to open some kind of command line interface on Windows, Mac or Linux and run a command. Our most precious users are biologists, not computer specialists, most of the time.

from gnames.

dimus avatar dimus commented on June 16, 2024

I am looking forward to figuring out ways to re-use your tools in my workflows, especially for the gnmatch tool: I'd very much like to be able to run a resolver.globalnames.org -like service locally with defined and versioned taxonomy name providers to match against.

I think gnames is getting close to be usable. So lets try to make it work locally on your machine when it is ready

from gnames.

jhpoelen avatar jhpoelen commented on June 16, 2024

@dimus I am eager to experiment with gnames !

from gnames.

jhpoelen avatar jhpoelen commented on June 16, 2024

Yes, if you set a database locally, you can purge all resources you do not care about.

PS I'd assume that I would be able install my own taxonomic name resources, instead of having to install some default database snapshot.

Something like:

# remove all local name resources
gnames clear 
# expect no links - because no resource are available
gnames verify "Homo sapiens"

# import itis names
gnames import https://example.org/itis.tar.gz
# now, names is verified against itis as expected
gnames verify "Homo sapiens"

from gnames.

dimus avatar dimus commented on June 16, 2024

PS I'd assume that I would be able install my own taxonomic name resources, instead of having to install some default database snapshot.

It would be great but low on a priority que

from gnames.

jhpoelen avatar jhpoelen commented on June 16, 2024

@dimus thanks for listening and responding. Just curious - how do you build the name database yourself?

from gnames.

dimus avatar dimus commented on June 16, 2024

@dimus thanks for listening and responding. Just curious - how do you build the name database yourself?

for now by hand with https://github.com/GlobalNamesArchitecture/dwca-hunter, I did include harvester into the old grant but more pressing projects got in the way, so it is still on the back burner.

from gnames.

jhpoelen avatar jhpoelen commented on June 16, 2024

@dimus thanks for sharing: I poked around in the source code. Very neat tool! This tool produces dwca with taxa tables from various different file formats right? What process/ schema do you use to load these dwca taxa tables in a database?

from gnames.

dimus avatar dimus commented on June 16, 2024

@jhpoelen right, the idea I had was to normalize data for ingestion. It does help to get data in, but I think it can be more automatic. I am toying with idea to skip database altogether, and just keep everything as a bunch of files from which to make gnames. But thats in the future, first i need to make gnames to work correctly, and hopefully with throughput of 5k names a second or so.

from gnames.

dimus avatar dimus commented on June 16, 2024

I imagine gndiff working like a tiny gnames that has only one data-source completely in memory. I imagine it would work like this:

  1. It takes 2 datasets (I would imagine they are usually less than 100k names?)
  2. Determine which of the datasets is smaller, and create matching/fuzzy matching data-structures in memory.
  3. Run larger dataset against the smaller one using pretty much the same algorithms as gnames/gnmatcher, all in memory.
  4. Spit matching results to STDOUT

from gnames.

dimus avatar dimus commented on June 16, 2024

about something completely different -- I found that Trichomanes bifidum Willd does not return correct record in the current version of the verifier, but is working correctly with new one

OLD:

https://verifier.globalnames.org/?all_matches=&all_sources=&capitalize=on&format=html&names=Trichomanes+bifidum+Willd

NEW: (no GUI yet)

http://verifier.globalnames.org/api/v0/verifications/Trichomanes+bifidum+Willd

It happens because of a bug in gnparser that is now fixed gnames/gnparser#212. Gives me an additional incentive to move faster with beta version of verification :)

from gnames.

joelnitta avatar joelnitta commented on June 16, 2024

That sketch sounds mostly good to me.

(I would imagine they are usually less than 100k names?)

My particular use-case (pteridophytes) includes ~65k names in the reference. I could imagine larger groups (e.g., angiosperms, all plants, etc) requiring a larger reference. But I'm not familiar with the memory requirements of the search algorithm... could there just be an option to request more memory / cores for larger datasets?

Determine which of the datasets is smaller

Is that because using the smaller dataset as the target more memory efficient? From a user's viewpoint, it doesn't matter which is treated as target/query by the algorithm as long as the user can see what names in the reference database match the query.

from gnames.

dimus avatar dimus commented on June 16, 2024

for 33Mil names it takes about 6GB on my laptop. So I would imagine things even with a million names should be ok for most computers.

Smaller dataset has to create all the matching datastructures in memory, and larger dataset would just go one name after another and match them. So larger dataset can acutally be huge and not affect memory consumption.

from gnames.

dimus avatar dimus commented on June 16, 2024

Time scale for gndiff is weeks, while for registry it is months/years. So I think I can play with gndiff idea in a not so distant future.

from gnames.

joelnitta avatar joelnitta commented on June 16, 2024

Great! IMHO there should be demand for this (besides myself) as I think it will greatly improve reproducibility if the user is not limited to selecting from an online database, which could be updated outside of their control at any time. And the memory requirements sound like it should be fine for a given custom database (not trying to match across GBIF, CoL, etc).

RE: Trichomanes bifidum Willd query example... if I'm reading the output correctly, it looks like it matched on the exact canonical form (Trichomanes bifidum) in either case? So I don't see how it could differentiate between Trichomanes bifidum C. Presl vs. Trichomanes bifidum Willd.

from gnames.

dimus avatar dimus commented on June 16, 2024

RE: Trichomanes bifidum Willd query example... if I'm reading the output correctly, it looks like it matched on the exact canonical form (Trichomanes bifidum) in either case? So I don't see how it could differentiate between Trichomanes bifidum C. Presl vs. Trichomanes bifidum Willd.

In GUI it matched with C. Presl in the best match, while in beta version it matched correctly with Willd in the best match.
Verification first makes canonical matching, but then sorts results according to finer details like authors, ranks, quality of parsing etc.

GUI uses /api/v1, and gnparser there has a bug that does not add ex authors to authors list, while showing them in the details section. Beta /api/v0 uses most recent gnparser and Willd got into the authors list.

from gnames.

joelnitta avatar joelnitta commented on June 16, 2024

Ah I see, thanks for the explanation. So "matchType":"Exact" and "matchedName":"Trichomanes bifidum Vent. ex Willd." in http://verifier.globalnames.org/api/v0/verifications/Trichomanes+bifidum+Willd is hiding a bit of detail about how the match was made. (Glad you got the ex bug sorted... unfortunate difference there between botanical and zoological practices!)

from gnames.

dimus avatar dimus commented on June 16, 2024

@joelnitta check out https://github.com/gnames/gndiff, it is a tool I started for comparing 2 names lists together. It is not optimized yet (mimimally viable product) and direction of its development woud depend on feedback and accumulated use-cases.

I did try to compare 300k names against 1mil names, it did survive the load, finished the job in 2min and ate 6G of memory in the process.

Please add your feedback, thoughts, feature requests as issues at https://github.com/gnames/gndiff, if you will have any.

from gnames.

joelnitta avatar joelnitta commented on June 16, 2024

Fantastic, thanks so much! I will definitely check it out.

from gnames.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.