Local NCBI taxonomy database
The data source
The NCBI maintain a taxonomy database, providing a taxonomic hierarchy for the sequences
stored in the NCBI system. The database is continually updated, but they provide an
archive
folder containing roughly monthly snapshots of the taxonomy database. We
recommend using these archive snapshots, because they provide a timestamped and
reproducible version of the underlying data. The available versions can be seen here:
To use the NCBI taxonomy as a local data resource for taxon validation, the
safedata_validator
package requires a version of the database to be built into a
SQLite3 database.
Building a local copy
The safedata_build_local_ncbi
command line tool is used to automatically download and
build the required file. The command line help is shown below - note that a particular
version can be selected by supplying the version date timestamp from the page above.
cl_prompt $ safedata_build_local_ncbi -h
usage: safedata_build_local_ncbi [-h] [-t TIMESTAMP] outfile
Build a local NCBI database.
This tool builds an SQLite database of the NCBI taxonomy to use in
validation by safedata_validate. There are multiple archived versions
of the dataset, and the available versions can be seen here:
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/
The tool will optionally take a timestamp - using the format '2021-11-26'
- to build a particular version, but defaults to the most recent version.
positional arguments:
outfile Filename to use for database file.
options:
-h, --help show this help message and exit
-t TIMESTAMP, --timestamp TIMESTAMP
The time stamp of a database archive version to use.
You will need to provide an output directory for the database and then use the command:
safedata_build_local_ncbi outdir
This should result in the following output:
- Downloading NCBI data to: /path/to/tempdir
- Connecting to FTP server
- Using most recent archive: 2023-11-01
- Downloading taxonomy to: /path/to/tempdir/taxdmp_2023-11-01.zip
100%|███████████████████████████████████████████| 60.5M/60.5M [00:03<00:00, 16.9MB/s]
- Building GBIF backbone database in: /path/to/outdir/ncbi_taxonomy_2023-11-01.sqlite
- Timestamp table created
- Creating nodes table
- Populating nodes table from nodes.dmp
2535034it [00:27, 93695.95it/s]
- Creating names table
- Populating names table from names.dmp
3914203it [00:31, 123281.67it/s]
- Creating merge table
- Populating merge table from merged.dmp
74501it [00:00, 155486.74it/s]
- Creating unique ranks table
- Creating database indexes
- Removing downloaded archive
Once you have an SQLite3 backbone database, you will then need to edit the
gbif_database
entry in your configuration file to provide
the path to your new SQLite file.
Build process overview
- The command line tool downloads the
taxdmp
zipfile archive for the requested timestamp. - This file contains data from a number of tables stored as DMP files, but only the
nodes.dmp
,names.dmp
, andmerged.dmp
are required. These describe the taxonomic nodes, the names assigned to specific nodes, and superseded nodes that have now been merged into other nodes, respectively. - The tool automatically extracts this data from within the zip archive.
- The nodes table contains a lot of un-needed fields - we therefore drop all fields
other than
tax_id
,parent_tax_id
,rank
, andcomments
. - The names table includes a field (
unique_name
) that gives a unique version of every name. Given thatsafedata_validator
has built in handling of ambiguous names and that these unique names are extremely specific (e.g.Bacteria <bacteria>
), we do not believe this is a useful field for our purpose. It is therefore dropped. - A single database is then generated with three tables:
nodes
,names
,merged
. - A
timestamp
table is created to hold the version timestamp. - A
unique_ncbi_ranks
table is created to hold the names of all ranks used in the NCBI database. -
The speed of the package is much improved by building covering indices to speed up the four kinds of searches used by
safedata_validator
: -
searches for specific taxon ids in the nodes table,
- searches for taxon names in the names table,
- searches for scientific names in the names tables and
- searches for specific taxon ids in the merged table.