Local GBIF backbone database
The data source
GBIF maintain a detailed database that is used to provide the hierarchical backbone
taxonomy underpinning GBIF species observations. They provide multiple versions of the
dataset - at roughly one year to six month intervals - that are identified with a date
timestamp. All versions are freely available from GBIF at the link below, and the
current
folder is a shortcut to the most recent version.
https://hosted-datasets.gbif.org/datasets/backbone/current/
To use the GBIF backbone taxonomy as a local data resource for taxon validation, the
safedata_validator
package requires a version of the database to be built into a
SQLite3 database.
Building the local GBIF database
The safedata_build_local_gbif
command line tool is used to automatically download and
build the required file. The command line help is shown below - note that a particular
version can be selected by supplying the version date timestamp from the page above.
cl_prompt $ safedata_build_local_gbif -h
usage: safedata_build_local_gbif [-h] [-t TIMESTAMP] outfile
Build a local GBIF database.
This tool builds an SQLite database of the GBIF backbone taxonomy to use
in validation by safedata_validate. There are multiple versions of the
dataset, and the available versions can be seen here:
https://hosted-datasets.gbif.org/datasets/backbone/
The tool will optionally take a timestamp - using the format '2021-11-26'
- to build a particular version, but defaults to the most recent version.
positional arguments:
outfile Filename to use for database file.
options:
-h, --help show this help message and exit
-t TIMESTAMP, --timestamp TIMESTAMP
The time stamp of a database archive version to use.
You will need to provide an output directory for the database and then use the command:
safedata_build_local_gbif outdir
This should result in the following output:
- Downloading GBIF data to: /path/to/tempdir
- Checking for version with timestamp 2023-08-28
- Downloading simple.txt.gz
100%|████████████████████████████████████████████| 466M/466M [00:05<00:00, 92.2MB/s]
- Downloading simple-deleted.txt.gz
100%|████████████████████████████████████████████| 90.8M/90.8M [00:00<00:00, 96.2MB/s]
- Building GBIF backbone database in: /path/to/outdir/gbif_backbone_2023-08-28.sqlite
- Timestamp table created
- Backbone table created
- Adding core backbone taxa
7746724it [03:31, 36688.61it/s]
- Adding deleted taxa
1711901it [00:42, 40470.91it/s]
- Creating database indexes
- Removing downloaded files
Once you have an SQLite3 backbone database, you will then need to edit the
gbif_database
entry in your configuration file to provide
the path to your new SQLite file.
Build process overview
From the archive directory for the version, the database is built from two files:
simple.txt.gz
and deleted.txt.gz
. These files contain the data for a simplified
version of the GBIF
backbone, including taxa that have been deleted from the GBIF backbone.
Those files are both dumps from a PostGRESQL database, and the definition (schema) for the resulting table can be found here
There are a number of steps needed to convert this data into a SQLite3 database, but the basic process is:
-
The
simple
file contains some very long fields (notablyname_published_in
) that include a lot of quotes and add to the file size, but are not used by the package. This field is therefore dropped. -
Both files contain a lot of
\N
values, which is a PostgreSQL symbol for a null (empty) field. SQLite3 will treat these as values and so they need to be converted to anull
value. -
The main
backbone
table is created and then data from both files are inserted as the data rows. -
A
timestamp
table is inserted to record the timestamp of the database version used to build the local file. -
The speed of the package is much improved by building covering indices to speed up the two kinds of searches used by
safedata_validator
: -
searches on the canonical name and rank of a taxon and
- searches on the taxon id.