Using the SAFE Dataset Checker

The safedata_validator package contains Python code to validate files containing data using SAFE data formatting and report on any problems. The code validates:

  1. The data submission formatting of the file.
  2. All taxonomic names against the GBIF taxonomy database.
  3. All location names against the SAFE Gazetteer.

This package is used to validate datasets submitted online to a SAFE data management website, such as the SAFE Project website. However, it can also be installed and run independently, for example by data managers or individual researchers.

The code is open source Python and is maintained on GitHub but can also be installed using PyPI. See the installation notes for setup instructions.

The package provides a command line program safedata_validate. The usage instructions are below but you will also need to provide links to some external data resources used in location and taxon validation.

usage: safedata_validator.py [-h] [-l LOCATIONS_JSON]
                               [--gbif_database GBIF_DATABASE]
                               [--validate_doi]
                               fname

This program validates an Excel file formatted as a SAFE dataset. As it runs,
it outputs a report that highlights any problems with the formatting. Much of
the validation is to check that the data meets our metadata standards and is
internally consistent. However, it uses external sources to perform validation
in three areas.

1. Taxon validation. The program validates taxonomic names against the GBIF
taxonomy backbone. By default, it uses the GBIF web API to validate names,
but can also use a local copy of the backbone provided in a sqlite database:
this will work offline and is much faster but requires some simple setup.

2. Location names. The program also validate sampling location names against
the SAFE gazeteer. By default, this is loaded automatically from the SAFE
website so requires an internet connection, but a local copy can be provided
for offline use.

3. DOI checking. Optionally, the program will validate any DOIs provided as
having used the database. This requires a web connection and cannot be
performed offline.

positional arguments:
  fname                 Path to the Excel file to be validated.

optional arguments:
  -h, --help            show this help message and exit
  -l LOCATIONS_JSON, --locations_json LOCATIONS_JSON
                        Path to a locally stored json file of valid location
                        names
  -g GBIF_DATABASE, --gbif_database GBIF_DATABASE
                        The path to a local sqlite database containing the
                        GBIF taxonomy backbone.
  --validate_doi        Check the validity of any publication DOIs, provided
                        by the user. Requires a web connection.

Data resources

The safedata_validator package requires external data resources to validate both dataset locations and taxa. The package supports online resources for both locations and taxa, which is the easiest option for users to get up and running. The online GBIF Search API is used by default but the package does need a web service providing valid location data.

For example, the SAFE Project website provides an API endpoint returning location data. Using this API and the default online GBIF validation, the following command will validate MyData.xlsx:

safedata_validate MyData.xlsx -l https://www.safeproject.net/api/validator_locations

This is considerably easier for most users but it can be rather slow and requires an internet connection. If you are want to improve the speed of safedata_validator for frequent use or need to be able to use it offline, then you can provide local copies of the data resources. Note that you cannot validate DOIs without an internet connection, but this is optional.

The locations of these resources are set by command line arguments shown above but can also be set in a configuration file for repeated use.

Locations

Locations are validated against a set of known location names and possible aliases for those names. The data resource providing this information is set with the location argument. This can either be a link to a web service as shown above or a static local JSON file to provide faster and offline use:

safedata_validate MyData.xlsx -l /path/to/validator_locations.json

Taxa

If you want to speed up taxon checking and allow offline use then you will need to download a copy of the backbone taxonomy and build a SQLite3 database from it. Using a local database is much faster than using the GBIF API online. This isn't particularly hard and is described in detail here, but the resulting database is around 1.6GB, so you'll need file space!

Once you have this file, you can use it like this:

safedata_validate MyData.xlsx -g /path/to/gbif_backbone.sqlite

Fully offline use

If you've done both the above steps then the following example would validate a file using both local data resources, and won't need the internet at all.

safedata_validate MyData.xlsx -g /path/to/gbif_backbone.sqlite \
    -l /path/to/validator_locations.json

Configuration file

You can also avoid having to specify location and taxa data resources every time you use safedata_validate by storing their locations in a configuration file. This is simply a JSON file that contains the resource locations.

For example, this file would set up safedata_validate to run online:

{
        "locations": "https://www.safeproject.net/api/validator_locations"
}

The next configuration sets up fully offline use:

{
        "locations": "/path/to/validator_locations.json",
        "gbif_database": "/path/to/gbif_backbone.sqlite"
}

In both cases, validation can now simply use:

safedata_validate MyData.xlsx

If you do provide command line arguments, they will override anything set in the configuration.

Configuration file location

To avoid having to provide a path to the configuration file, safedata_validator looks in specific locations for configuration files. Conventions for config file locations differ across operating systems and we use the conventions used by the appdirs package.

In addition, safedata_validator will look for both user and site configuration files. Site configurations allow a data manager to set up a specific machine with data resources for all users. If both are present, the user configuration is used. The configuration file must be called safedata_validator.json and the user and site config folders are:

On Mac OS X:

/Users/username/Library/Application Support/safedata_validator/
/Library/Application Support/safedata_validator/

On Windows (the repeated name is not an error):

C:\\Users\\username\\AppData\\Local\\safedata_validator\\safedata_validator

On Linux:

/home/username/.config/safedata_validator
/etc/xdg/safedata_validator