Skip to content

Using the SAFE Dataset Checker

The safedata_validator package contains Python code to validate files containing data using safedata formatting and report on any problems. The code validates:

  1. The data submission formatting of the file.
  2. All taxonomic names against the GBIF taxonomy database.
  3. All location names against a locations gazetteer.

The package can be imported in Python for use within other frameworks. However, the package also provides a command line tool, that allows it to be installed and run as a standalone application, for example by data managers or individual researchers.

Configuring data resources

The safedata_validator package requires external data resources to validate both dataset locations and taxa. You will need to create a configuration file to set safedata_validator up to find those resources.

Note that a key resource - the GBIF taxonomy database - require a local SQLite3 database containing the core data from this database. This is a relatively large file (~ 2GB in total). The package provides a command (safedata_build_local_gbif to download and build this database, and the path to this file can then be included in the configuration.

Note that you cannot validate DOIs without an internet connection, but this is optional.

GBIFTaxa

To validate taxonomic information against GBIF, you will need to download a copy of the GBIF backbone taxonomy and build a SQLite3 database from it. The package provides a template Python script to do this. If you are happy with running Python scripts, then it is not particularly difficult if you follow the instructions provided. The resulting database file is around 1.6GB, so you'll need file space!

Using safedata_validate

Once you have setup and configured safedata_validator, the usage instructions are below:

cl_prompt $ safedata_validate -h
usage: safedata_validate [-h] [-r RESOURCES] [-s] [--validate_doi]
                         [--chunk_size CHUNK_SIZE] [-l LOG] [-j JSON]
                         [--version]
                         [filename]

Validate a dataset using a command line interface.

    This program validates an Excel file formatted as a `safedata` dataset.
    As it runs, it outputs a report that highlights any problems with the
    formatting. Much of the validation is to check that the data meets our
    metadata standards and is internally consistent.

    However, the package uses external resources to perform validation of
    taxa and sampling locations and to provide other information. For
    this reason, using this program requires you to provide a configuration
    file for these resources or to have installed a configuration file in a
    standard location. If you run `safedata_validate` without a
    configuration file, the output will report the standard locations for
    your operating system.

    If validation is successful, then a JSON format file containing key
    metadata will be saved. This is used in the dataset publication process.
    By default, the JSON file is saved to the same directory as the input
    file, using the same filename but with the `.json` extension. This can
    be saved elsewhere using the `--json` option.

    The command also outputs a log of the validation process, which
    identifies validation issues. This defaults to being written to
    stderr but can be redirected to a file using the `--log` option.

positional arguments:
  filename      Path to the Excel file to be validated

options:
  -h, --help    show this help message and exit
  -r RESOURCES, --resources RESOURCES
                A path to a resources configuration file
  -s, --show-resources
                Validate and display the selected resources and exit
  --validate_doi
                Check the validity of any publication DOIs, provided by the
                user. Requires a web connection.
  --chunk_size CHUNK_SIZE
                Data are loaded from worksheets in chunks: the number of rows
                in a chunk is set by this argument
  -l LOG, --log LOG
                Save the validation log to a file, not print to the console.
  -j JSON, --json JSON
                An optional output path for the validated dataset JSON.
  --version     show program's version number and exit

Essentially, you should now be able to do:

safedata_validate MyDataset.xlsx

The program will then validate the input dataset, printing information about the validation process and any errors in the dataset as it goes. When a dataset passes validation, a JSON file containing the metadata for the validated dataset will be created.