Skip to content

Taxonomic validation process

This document provides an overview of the validation process used for GBIF taxa.

GBIF validation

GBIF validation searches the GBIF backbone taxonomy database for a taxon that matches both the taxon scientific name and taxonomic rank provided in a dataset. If a match is found then the GBIF database will supply one of the following status codes:

accepted
doubtful 
homotypic synonym
synonym
heterotypic synonym
proparte synonym
misapplied

The first two options (accepted and doubtful) are treated as canonical matches, but the remaining status codes will provide a link to an accepted taxonomic usage. The validation process therefore extracts three things from the backbone database:

  1. The status of the name and rank provided by the user.
  2. The accepted usage (which might be the same).
  3. The backbone taxonomic hierarchy for the taxa, so we can index at higher taxonomic levels.

However, this is complicated as there is no guarantee that taxa will always hook into the backbone at the next deeper taxonomic level. For example, the accepted taxon Wanosuchus atresus is only hooked in at order level: it is accepted as a species, but its parent is Crocodylia as the genus is doubtful. Similarly, Goniopholis tenuidens is a synonym at species level but again has Crocodylia as a parent (and is considered a synonym for the family Goniopholidae).

In some cases, taxa can hook in at taxonomic levels more nested than their own: the species Brittonastrum greenei is a synonym of the subspecies Agastache pallidiflora pallidiflora.

The parent_taxon_level_analysis.R file in this repository contains some code to check this:

  • All accepted taxa map to a more nested parent but 5% map to a more nested parent more than one step up the hierarchy. The table below shows child taxon level as rows and parent taxon level as columns.
kingdom phylum class order family genus species subspecies variety form
kingdom 0 0 0 0 0 0 0 0 0 0
phylum 100 0 0 0 0 0 0 0 0 0
class 5 316 0 0 0 0 0 0 0 0
order 7 45 1327 0 0 0 0 0 0 0
family 2191 1339 4267 14423 0 0 0 0 0 0
genus 3427 4985 5584 6260 220735 0 0 0 0 0
species 1567 706 1529 696 8944 2449414 0 0 0 0
subspecies 41 7 3 2 832 268 200902 0 0 0
variety 53 10 0 26 2661 50 82914 32 0 0
form 12 4 0 4 815 18 19272 0 56 0
  • Only 77% of unaccepted taxa map to a parent at the next most nested taxonomic level and 4.5% map to a parent at the same or a less nested level, as in the example above.
kingdom phylum class order family genus species subspecies variety form
kingdom 0 0 0 0 0 0 0 0 0 0
phylum 22 8 0 0 0 0 0 0 0 0
class 0 14 1 0 0 0 0 0 0 0
order 0 5 32 0 0 0 0 0 0 0
family 21 157 481 3599 0 0 0 0 0 0
genus 8555 24242 25055 31010 185911 0 0 0 0 0
species 64 24 173 405 2142 1886329 121225 84 5 0
subspecies 3 0 1 0 151 77512 26266 13 0 0
variety 2 1 0 2 367 212954 50062 47 4 0
form 0 0 0 0 128 48126 10449 3 2 0

GBIF validation process

Validation against the local GBIF database works by using an initial SQL query using the provided name and rank:

SELECT * FROM backbone
    WHERE canonicalName = 'XX'
    AND taxonRank = 'YY';

This will return all exact matches for the combination, which can include rows for multiple taxa with different description dates and authors, references and taxonomic statuses. The safedata_validator package examines the returned rows and tries to identify a single accepted, doubtful or non-canonical usage in that order of preference.

The backbone table row for the identified taxon provides both acceptedNameUsageID and parentNameUsageID fields that can be easily used in subsequent searches to get the accepted name and hook the taxon into the higher taxonomy. Although canonical higher taxon names are provided, their taxon IDs are not, so higher taxa are saved into a set containing unique pairs of names and ranks and then added to the index when all taxa have been processed.

Problems

Rare edge cases include taxon names with two equally approved usages: for example, the genus Morus is an accepted usage for both mulberries and gannets. This kind of problem is described in the JSON "note" field and provided to the user. In these rare cases, an accepted usage would require a GBIF taxon ID to be provided to discriminate between them.