Synopsis

Scientific names are critical metadata elements in biodiversity. They are the scaffolding upon which all biological information hangs. However, scientific names are imperfect identifiers. Some taxa share the same name (e.g. homonyms across nomenclature codes) and there can be many names for the same taxon. Names change because of taxonomic and nomenclatural revisions and they can be persistently misspelled in the literature. Optical scanning of printed material compounds the problem by introducing greater uncertainty in data integration.

This verification service tries to answer the following questions about a string representing a scientific name:

  • Is this a name?
  • Is it spelled correctly?
  • Is this name currently in use?
  • What other names are related to this name (e.g. synonyms, lexical variants)?
  • If this name is a homonym, which is the correct one?

Matching Process

1. Exact Matching

Submitted names are parsed first and their canonical forms are checked for exact matches against names in the entire verifier database. An algorithm than sorts names according to scoring algorithm and returns the best match back.

Canonical forms

Name strings are often supplied with complex authorship information [e.g. Racomitrium canescens f. epilosum (H. Müll. ex Milde) G. Jones in Grout]. The Global Name parser strips authorship and rank information from names [e.g. Racomitrium canescens epilosum], which makes it possible to compare the string with other variants of the same name. Resulting canonical forms are checked for exact matches against canonical forms in specified data sources or in the entire resolver database. All found names are removed from the process at the completion of this step.

The GNparser program performs all the parsing steps

2. Fuzzy Matching of Canonical Forms

Mistakes, misspellings, or OCR errors can create incorrect variants of scientific names. Remaining canonical forms generated from the previous step are fuzzily matched against canonical forms in specified data sources. We use a modified version of the TaxaMatch algorithm developed by Tony Rees. After this step all found names are removed from the process.

3. Partial Exact Matching of Names

Some infraspecific names do not match anything in the verification database. Sometimes it happens because the name does not exist in the collected data. Sometimes a 'junk' word is wrongly included and the parser may recognize it as an infraspecific epithet. Sometimes an infraspecies are "promoted" to species and the middle word disappears. The algorithm removes middle or terminal words and tries to match resulting canonical forms. For example, the last word "Pardosa moesta spider" would be ignored given a match to "Pardosa moesta".

4. Fuzzy Partial Matching

If exact partial matching failed, we try to make an aproximate match.

5. Exact Matching of a Genus Part

If anything else fails we try to match an apparent genus of the input.

Scoring algorithm

More often than not, the verification returns more than one result back. In some occations there might be thousands of matching names. We decided to return only one "best" result, still giving a possibility to get data from data-sources a user is interested in. The algorithm uses the following criteria for sorting the results:

Infraspecific ranks

Botanical nomenclatural code allows a variety of ranks in the infraspecific names. The algorithm favors results that contain the same rank as the input name.

Edit distance

In cases when results are "fuzzy-matched", algorithm favors matches with the smallest edit distance determined according to Levenshtein algorithm.

Data source curation

Algorithm favors data-sources that are known for a significant curatorial effort over ones that are not curated, or their curation effort is unknown.

Authorship

For inputs that contain authorship, algorithm favors matches that contain the same, or similar authorship.

Current acceptance of a name

A result is favored over other results, if it is a currently accepted name, and not some kind of a synonym or a misspelling.

Parsing quality

GNparser returns a parsing quality value after extraction of a canonical form. The algorithm favors high quality parsing over lower quality.

Preferred data sources

Sometimes a user is more interested to get results from a particular data-source, and less interested in a "best result". For such cases there is an option to always return data from such a data-source. It is also possible to completely ignore "best result". It might be useful when a user tries to map their checklist to a particular data-source.

Code on GitHub