Taxonomic validation process
This document provides an overview of the validation process used for GBIF taxa.
GBIF validation
GBIF validation searches the GBIF backbone taxonomy database for a taxon that matches both the taxon scientific name and taxonomic rank provided in a dataset. If a match is found then the GBIF database will supply one of the following status codes:
accepted
doubtful
homotypic synonym
synonym
heterotypic synonym
proparte synonym
misapplied
The first two options (accepted and doubtful) are treated as canonical matches, but the remaining status codes will provide a link to an accepted taxonomic usage. The validation process therefore extracts three things from the backbone database:
- The status of the name and rank provided by the user.
- The accepted usage (which might be the same).
- The backbone taxonomic hierarchy for the taxa, so we can index at higher taxonomic levels.
However, this is complicated as there is no guarantee that taxa will always hook into the backbone at the next deeper taxonomic level. For example, the accepted taxon Wanosuchus atresus is only hooked in at order level: it is accepted as a species, but its parent is Crocodylia as the genus is doubtful. Similarly, Goniopholis tenuidens is a synonym at species level but again has Crocodylia as a parent (and is considered a synonym for the family Goniopholidae).
In some cases, taxa can hook in at taxonomic levels more nested than their own: the species Brittonastrum greenei is a synonym of the subspecies Agastache pallidiflora pallidiflora.
The parent_taxon_level_analysis.R file in this repository contains some code
to check this:
- All accepted taxa map to a more nested parent but 5% map to a more nested parent more than one step up the hierarchy. The table below shows child taxon level as rows and parent taxon level as columns.
| kingdom | phylum | class | order | family | genus | species | subspecies | variety | form | |
|---|---|---|---|---|---|---|---|---|---|---|
| kingdom | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| phylum | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| class | 5 | 316 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| order | 7 | 45 | 1327 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| family | 2191 | 1339 | 4267 | 14423 | 0 | 0 | 0 | 0 | 0 | 0 |
| genus | 3427 | 4985 | 5584 | 6260 | 220735 | 0 | 0 | 0 | 0 | 0 |
| species | 1567 | 706 | 1529 | 696 | 8944 | 2449414 | 0 | 0 | 0 | 0 |
| subspecies | 41 | 7 | 3 | 2 | 832 | 268 | 200902 | 0 | 0 | 0 |
| variety | 53 | 10 | 0 | 26 | 2661 | 50 | 82914 | 32 | 0 | 0 |
| form | 12 | 4 | 0 | 4 | 815 | 18 | 19272 | 0 | 56 | 0 |
- Only 77% of unaccepted taxa map to a parent at the next most nested taxonomic level and 4.5% map to a parent at the same or a less nested level, as in the example above.
| kingdom | phylum | class | order | family | genus | species | subspecies | variety | form | |
|---|---|---|---|---|---|---|---|---|---|---|
| kingdom | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| phylum | 22 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| class | 0 | 14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| order | 0 | 5 | 32 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| family | 21 | 157 | 481 | 3599 | 0 | 0 | 0 | 0 | 0 | 0 |
| genus | 8555 | 24242 | 25055 | 31010 | 185911 | 0 | 0 | 0 | 0 | 0 |
| species | 64 | 24 | 173 | 405 | 2142 | 1886329 | 121225 | 84 | 5 | 0 |
| subspecies | 3 | 0 | 1 | 0 | 151 | 77512 | 26266 | 13 | 0 | 0 |
| variety | 2 | 1 | 0 | 2 | 367 | 212954 | 50062 | 47 | 4 | 0 |
| form | 0 | 0 | 0 | 0 | 128 | 48126 | 10449 | 3 | 2 | 0 |
GBIF validation process
Validation against the local GBIF database works by using an initial SQL query using the provided name and rank:
SELECT * FROM backbone
WHERE canonicalName = 'XX'
AND taxonRank = 'YY';
This will return all exact matches for the combination, which can include rows for
multiple taxa with different description dates and authors, references and taxonomic
statuses. The safedata_validator package examines the returned rows and tries to
identify a single accepted, doubtful or non-canonical usage in that order of
preference.
The backbone table row for the identified taxon provides both acceptedNameUsageID and
parentNameUsageID fields that can be easily used in subsequent searches to get the
accepted name and hook the taxon into the higher taxonomy. Although canonical higher
taxon names are provided, their taxon IDs are not, so higher taxa are saved into a set
containing unique pairs of names and ranks and then added to the index when all taxa
have been processed.
Problems
Rare edge cases include taxon names with two equally approved usages: for
example, the genus Morus is an accepted usage for both mulberries and gannets.
This kind of problem is described in the JSON "note" field and provided to the
user. In these rare cases, an accepted usage would require a GBIF taxon ID to be
provided to discriminate between them.