Taxonomic validation process

This document provides an overview of the validation process used for GBIF and NCBI taxa.

GBIF validation

GBIF validation searches the GBIF backbone taxonomy database for a taxon that matches both the taxon scientific name and taxonomic rank provided in a dataset. If a match is found then the GBIF database will supply one of the following status codes:

accepted
doubtful 
homotypic synonym
synonym
heterotypic synonym
proparte synonym
misapplied

The first two options (accepted and doubtful) are treated as canonical matches, but the remaining status codes will provide a link to an accepted taxonomic usage. The validation process therefore extracts three things from the backbone database:

The status of the name and rank provided by the user.
The accepted usage (which might be the same).
The backbone taxonomic hierarchy for the taxa, so we can index at higher taxonomic levels.

However, this is complicated as there is no guarantee that taxa will always hook into the backbone at the next deeper taxonomic level. For example, the accepted taxon Wanosuchus atresus is only hooked in at order level: it is accepted as a species, but its parent is Crocodylia as the genus is doubtful. Similarly, Goniopholis tenuidens is a synonym at species level but again has Crocodylia as a parent (and is considered a synonym for the family Goniopholidae).

In some cases, taxa can hook in at taxonomic levels more nested than their own: the species Brittonastrum greenei is a synonym of the subspecies Agastache pallidiflora pallidiflora.

The parent_taxon_level_analysis.R file in this repository contains some code to check this:

All accepted taxa map to a more nested parent but 5% map to a more nested parent more than one step up the hierarchy. The table below shows child taxon level as rows and parent taxon level as columns.

	kingdom	phylum	class	order	family	genus	species	subspecies	variety
kingdom	0	0	0	0	0	0	0	0	0
phylum	100	0	0	0	0	0	0	0	0
class	5	316	0	0	0	0	0	0	0
order	7	45	1327	0	0	0	0	0	0
family	2191	1339	4267	14423	0	0	0	0	0
genus	3427	4985	5584	6260	220735	0	0	0	0
species	1567	706	1529	696	8944	2449414	0	0	0
subspecies	41	7	3	2	832	268	200902	0	0
variety	53	10	0	26	2661	50	82914	32	0
form	12	4	0	4	815	18	19272	0	56

Only 77% of unaccepted taxa map to a parent at the next most nested taxonomic level and 4.5% map to a parent at the same or a less nested level, as in the example above.

	kingdom	phylum	class	order	family	genus	species	subspecies	variety
kingdom	0	0	0	0	0	0	0	0	0
phylum	22	8	0	0	0	0	0	0	0
class	0	14	1	0	0	0	0	0	0
order	0	5	32	0	0	0	0	0	0
family	21	157	481	3599	0	0	0	0	0
genus	8555	24242	25055	31010	185911	0	0	0	0
species	64	24	173	405	2142	1886329	121225	84	5
subspecies	3	0	1	0	151	77512	26266	13	0
variety	2	1	0	2	367	212954	50062	47	4
form	0	0	0	0	128	48126	10449	3	2

GBIF validation process

Validation against the local GBIF database works by using an initial SQL query using the provided name and rank:

SELECT * FROM backbone
    WHERE canonicalName = 'XX'
    AND taxonRank = 'YY';

This will return all exact matches for the combination, which can include rows for multiple taxa with different description dates and authors, references and taxonomic statuses. The safedata_validator package examines the returned rows and tries to identify a single accepted, doubtful or non-canonical usage in that order of preference.

The backbone table row for the identified taxon provides both acceptedNameUsageID and parentNameUsageID fields that can be easily used in subsequent searches to get the accepted name and hook the taxon into the higher taxonomy. Although canonical higher taxon names are provided, their taxon IDs are not, so higher taxa are saved into a set containing unique pairs of names and ranks and then added to the index when all taxa have been processed.

Problems

Rare edge cases include taxon names with two equally approved usages: for example, the genus Morus is an accepted usage for both mulberries and gannets. This kind of problem is described in the JSON "note" field and provided to the user. In these rare cases, an accepted usage would require a GBIF taxon ID to be provided to discriminate between them.

NCBI validation

The basic idea is to search taxon names in the NCBI database to look for matches, if a match is found then the supplied rank and (optionally) NCBI taxa ID are used to verify whether that this match is as expected. When searching taxa names that are no longer considered to be the proper scientific name for a taxon, the validation automatically maps onto the new taxonomic name. In cases where a two taxon are considered to be equivalent NCBI removes the superseded taxon, and then records the ID of this taxon in a separate table. This means that our validation method can account for the entry of outdated taxa (and their IDs), which pass validation with a warning being provided to the user.

We want three things from validation:

To know whether the name provided by the user refers to an existing taxon.
To know whether or not this taxon is superseded.
The backbone taxonomic hierarchy for the taxa, so we can index at higher taxonomic levels. Taxa do not always neatly hook into the backbone taxonomic rank immediately above. However, this isn't too much of a problem in the NCBI case as parent IDs are always supplied so a full taxon hierarchy can always be constructed, which can then be pruned to form a backbone hierarchy.

Unlike NCBI does not provide status codes to taxa. As describes above it just merges outdated taxa names and taxa IDs with up to date names and IDs. Our validation proceed makes a note of where this happens. However, we do record status codes so that GBIF and NCBI taxonomic information can be stored in the same format. The three status codes we define are as follows:

accepted
merged
user

The first of these (accepted) is for taxa found using a currently valid name or ID, the second of these (merged) is for valid taxa found using either a previously valid name or ID that has been merged with another name ID. Finally, 'user' is used for taxa that are not found in the NCBI database but have valid parent information. This is particularly useful for potentially novel taxa, i.e. ones defined by the user.