Skip to content

Using the SAFE Dataset Checker

The safedata_validator package contains Python code to validate files containing data using safedata formatting and report on any problems. The code validates:

  1. The data submission formatting of the file.
  2. All taxonomic names against either the GBIF or NCBI taxonomy databases.
  3. All location names against a locations gazetteer.

The package can be imported in Python for use within other frameworks. However, the package also provides a command line tool, that allows it to be installed and run as a standalone application, for example by data managers or individual researchers.

Configuring data resources

The safedata_validator package requires external data resources to validate both dataset locations and taxa. You will need to create a configuration file to set safedata_validator up to find those resources.

Note that two key resources - the GBIF and NCBI taxonomy databases - require local SQLite3 databases containing the core data from those databases. These are relatively large files (~ 2GB in total). The package provides two commands (safedata_build_local_gbif and safedata_build_local_ncbi) to download and build these databases, and the path to those files can then be included in the configuration.

Note that you cannot validate DOIs without an internet connection, but this is optional.

GBIFTaxa

To validate taxonomic information against GBIF, you will need to download a copy of the GBIF backbone taxonomy and build a SQLite3 database from it. The package provides a template Python script to do this. If you are happy with running Python scripts, then it is not particularly hard and is described in detail here. The resulting database file is around 1.6GB, so you'll need file space!

NCBITaxa

Similarly, taxon validation against NCBI requires you to download a snapshot of the NCBI database and building a SQLite3 database from it. Using a local database is substantially faster than using the online NCBI Entrez tools, which has a built-in rate limitation. Instructions on how to construct the local database are given here. Again, the resulting database is large (~600 MB) so you will need to ensure you have sufficient file space!

Using safedata_validate

Once you have setup and configured safedata_validator, the usage instructions are below:

cl_prompt $ safedata_validate -h
usage: safedata_validate [-h] [-r RESOURCES] [-s] [--validate_doi]
                         [--chunk_size CHUNK_SIZE] [-l LOG] [-j JSON]
                         [--version]
                         [filename]

Validate a dataset using a command line interface.

    This program validates an Excel file formatted as a `safedata` dataset.
    As it runs, it outputs a report that highlights any problems with the
    formatting. Much of the validation is to check that the data meets our
    metadata standards and is internally consistent.

    However, the package uses external resources to perform validation of
    taxa and sampling locations and to provide other information. For
    this reason, using this program requires you to provide a configuration
    file for these resources or to have installed a configuration file in a
    standard location. If you run `safedata_validate` without a
    configuration file, the output will report the standard locations for
    your operating system.

    If validation is successful, then a JSON format file containing key
    metadata will be saved. This is used in the dataset publication process.
    By default, the JSON file is saved to the same directory as the input
    file, using the same filename but with the `.json` extension. This can
    be saved elsewhere using the `--json` option.

    The command also outputs a log of the validation process, which
    identifies validation issues. This defaults to being written to
    stderr but can be redirected to a file using the `--log` option.

positional arguments:
  filename      Path to the Excel file to be validated

options:
  -h, --help    show this help message and exit
  -r RESOURCES, --resources RESOURCES
                A path to a resources configuration file
  -s, --show-resources
                Validate and display the selected resources and exit
  --validate_doi
                Check the validity of any publication DOIs, provided by the
                user. Requires a web connection.
  --chunk_size CHUNK_SIZE
                Data are loaded from worksheets in chunks: the number of rows
                in a chunk is set by this argument
  -l LOG, --log LOG
                Save the validation log to a file, not print to the console.
  -j JSON, --json JSON
                An optional output path for the validated dataset JSON.
  --version     show program's version number and exit

Essentially, you should now be able to do:

safedata_validate MyDataset.xlsx

The program will then validate the input dataset, printing information about the validation process and any errors in the dataset as it goes. When a dataset passes validation, a JSON file containing the metadata for the validated dataset will be created.