Using the SAFE Dataset Checker
The safedata_validator
package contains Python code to validate files
containing data using safedata
formatting and report on any problems. The code
validates:
- The data submission formatting of the file.
- All taxonomic names against either the GBIF or NCBI taxonomy databases.
- All location names against a locations gazetteer.
The package can be imported in Python for use within other frameworks. However, the package also provides a command line tool, that allows it to be installed and run as a standalone application, for example by data managers or individual researchers.
Configuring data resources
The safedata_validator
package requires external data resources to validate
both dataset locations and taxa. You will need to create a configuration
file to set safedata_validator
up to find those
resources.
Note that two key resources - the GBIF and NCBI taxonomy databases - require local SQLite3 databases containing the core data from those databases. These are relatively large files (~ 2GB in total). The package provides two commands (safedata_build_local_gbif and safedata_build_local_ncbi) to download and build these databases, and the path to those files can then be included in the configuration.
Note that you cannot validate DOIs without an internet connection, but this is optional.
GBIFTaxa
To validate taxonomic information against GBIF, you will need to download a copy of the GBIF backbone taxonomy and build a SQLite3 database from it. The package provides a template Python script to do this. If you are happy with running Python scripts, then it is not particularly hard and is described in detail here. The resulting database file is around 1.6GB, so you'll need file space!
NCBITaxa
Similarly, taxon validation against NCBI requires you to download a snapshot of the NCBI database and building a SQLite3 database from it. Using a local database is substantially faster than using the online NCBI Entrez tools, which has a built-in rate limitation. Instructions on how to construct the local database are given here. Again, the resulting database is large (~600 MB) so you will need to ensure you have sufficient file space!
Using safedata_validate
Once you have setup and configured safedata_validator
, the usage instructions are
below:
cl_prompt $ safedata_validate -h
usage: safedata_validate [-h] [-r RESOURCES] [-s] [--validate_doi]
[--chunk_size CHUNK_SIZE] [-l LOG] [-j JSON]
[--version]
[filename]
Validate a dataset using a command line interface.
This program validates an Excel file formatted as a `safedata` dataset.
As it runs, it outputs a report that highlights any problems with the
formatting. Much of the validation is to check that the data meets our
metadata standards and is internally consistent.
However, the package uses external resources to perform validation of
taxa and sampling locations and to provide other information. For
this reason, using this program requires you to provide a configuration
file for these resources or to have installed a configuration file in a
standard location. If you run `safedata_validate` without a
configuration file, the output will report the standard locations for
your operating system.
If validation is successful, then a JSON format file containing key
metadata will be saved. This is used in the dataset publication process.
By default, the JSON file is saved to the same directory as the input
file, using the same filename but with the `.json` extension. This can
be saved elsewhere using the `--json` option.
The command also outputs a log of the validation process, which
identifies validation issues. This defaults to being written to
stderr but can be redirected to a file using the `--log` option.
positional arguments:
filename Path to the Excel file to be validated
options:
-h, --help show this help message and exit
-r RESOURCES, --resources RESOURCES
A path to a resources configuration file
-s, --show-resources
Validate and display the selected resources and exit
--validate_doi
Check the validity of any publication DOIs, provided by the
user. Requires a web connection.
--chunk_size CHUNK_SIZE
Data are loaded from worksheets in chunks: the number of rows
in a chunk is set by this argument
-l LOG, --log LOG
Save the validation log to a file, not print to the console.
-j JSON, --json JSON
An optional output path for the validated dataset JSON.
--version show program's version number and exit
Essentially, you should now be able to do:
safedata_validate MyDataset.xlsx
The program will then validate the input dataset, printing information about the validation process and any errors in the dataset as it goes. When a dataset passes validation, a JSON file containing the metadata for the validated dataset will be created.