Configuring the `safedata_validator` package

The safedata_validator package needs to be configured to use specific resources for data validation using an INI format configuration file.

Configuration file format

The file structure for the configuration file is a text file containing the details below:

gbif_database = /path/to/local/backbone.sqlite3
ncbi_database = /path/to/local/ncbi_database.sqlite3
gazetteer = /path/to/gazeteer.geojson
location_aliases = /path/to/location_aliases.csv
project_database = /path/to/project_database.csv

[extents]
temporal_soft_extent = 2002-02-02, 2030-01-31
temporal_hard_extent = 2002-02-01, 2030-02-01
latitudinal_hard_extent = -90, 90
latitudinal_soft_extent = -4, 6
longitudinal_hard_extent = -180, 180
longitudinal_soft_extent = 110, 120

[zenodo]
community_name = safe
contact_name = The SAFE Project
contact_affiliation = Imperial College London
contact_orcid = 0000-0003-3378-2814
use_sandbox = true
zenodo_api = https://api.zenodo.org
zenodo_token = abc
zenodo_sandbox_api = https://sandbox.zenodo.org
zenodo_sandbox_token = xyz

[metadata]
api = https://safeproject.net
token = xyz
ssl_verify = true

[xml]
languageCode=eng
characterSet=utf8
contactCountry=United Kingdom
contactEmail=admin@safeproject.net
epsgCode=4326
projectURL=https://safeproject.net
topicCategories=biota,environment,geoscientificInformation
lineageStatement="""This dataset was collected as part of a research project
based at The SAFE Project. For details of the project and data collection,
see the methods information contained within the datafile and the project
website: https://safeproject.net."""

Configuration file locations

You can put configuration files in any location and can even have multiple configurations that point to different resources, such as different versions of taxonomic databases. You can always use a specific configuration by providing a safedata_validator tool with the path to that configuration file, using the --resources option.

However, if you only use a single configuration, the safedata_validator tools can automatically load that configuration if it is saved to a specific location. Conventions for configuration file locations differ across operating systems and we use the conventions used by the appdirs package.

The safedata_validator package will look for both user and site configuration files. User configuration are only available for a particular user account, but site configurations allow a data manager to set up a specific machine with data resources for all users. If both are present, the user configuration is preferred.

For site and user configurations, the file must be called safedata_validator.cfg and the user and site locations are:

Mac OS XWindowsLinux

User: /Users/username/Library/Application Support/safedata_validator/safedata_validator.cfg
Site: /Library/Application Support/safedata_validator/safedata_validator.cfg

The repeated directory names are not an error!

User: C:\Users\username\AppData\Local\safedata_validator\safedata_validator\safedata_validator.cfg
Site: C:\\ProgramData\\safedata_validator\\safedata_validator\safedata_validator.cfg

User: /home/username/.config/safedata_validator/safedata_validator.cfg
Site: /etc/xdg/safedata_validator/safedata_validator.cfg

Configuration components

The configuration file content breaks down into three distinct parts of the safedata_validator workflow:

Validation: the core resource files, such as taxonomic databases and gazetteer, and the extents settings for a project, that are used to check a dataset is valid.
Publication: the account settings and access tokens used to publish validated datasets to Zenodo. This also includes the section of XML details, since we recommend including XML metadata with published datasets.
Metadata: the URL and access tokens to send metadata about a dataset to a metadata server, allowing datasets to be accessed using the safedata R package.

Info

You do not need to configure the publication and metadata sections if you are only using safedata_validator to validate datasets. You can also publish datasets to Zenodo without needing to set up and configure a metadata server.

Validation configuration

The validation configuration includes the following components, all of which must be provided to validate datasets.

The gbif_database element

This element provides the path to the local GBIF backbone database to be used in this configuration.

The ncbi_database element

This element provides the path to the local NCBI backbone database to be used in this configuration.

The gazetteer and location_aliases elements

These elements provide the paths to the location database for the project.

The project_database element

The project database element is an optional configuration setting that allows datasets to be grouped into projects. If you want to use projects then you will need to create a CSV file containing at least project_id and title fields, although you can add other fields if you want.

The project database can be updated to add new projects and change titles and other details but you must not change or delete existing Project IDs once they have been created - a given project ID must always refer to the same project.

Warning

Each deployment of the safedata system will have to make a binding choice of whether or not to organise datasets into project. The data manager for a project will need to make this decision during the initial configuration of a data system.

The extents element

The safedata_validator package tracks the geographic and temporal extents of datasets, which are needed to generate Gemini 2 metadata for a dataset. A project can provide both soft extents, which cause validation to raise a warning, and hard extents, which case validation to fail.

By default, only hard extents on geographic coordinates are applied:

latitudinal_hard_extent: (-90, 90)
longitudinal_hard_extent: (-180, 180)

Publication configuration

The safedata_zenodo command line tool provides functionality to for publish validated datasets to the Zenodo data repository. You will need to:

Create an account with Zenodo.
Use this account to create a new Zenodo community that will be used to group all published datasets.
From your user account, generate an access token that will allow safedata_zenodo to authenticate access to the Zenodo API for uploading datasets. The token will need to have the deposit:actions scope.

It is extremely highly recommended that you repeat the sign up steps above using the sandbox version of Zenodo, using the same community name. The sandbox site:

provides an identical environment to the real Zenodo site, so that upload function outputs can be checked, without adding unnecessary (or invalid) datasets to the official repository.
allows you to test the safedata_validator workflow all the way through to dataset publication without generating live DOIs.

The configuration element use_sandbox can be used to switch between testing and actual publication. For more information on the sandbox site, see this page.

Important

The access tokens and credentials generated above provide administrator access to your Zenodo community and datasets and you should store them securely. As they are included in the safedata_validator configuration file, you must be careful about who has access to a computer setup to provide validation.

Once you have been through this process, you can then fill out the following configuration elements:

The community_name element: This sets the Zenodo community to be used for publishing datasets on both the main and sandbox Zenodo sites.
The contact_name, contact_affiliation, and contact_orcid elements: All datasets will be published using these contact details and should provide a permanent set of contact information for the project datasets. Note that this is different from the dataset authors.
The zenodo_token and zenodo_sandbox_token elements: These are the personal access tokens generated for the user accounts on the main and sandbox Zenodo sites.
The use_sandbox element: When this element is true, all datasets will be published to the testing sandbox site. Set this to false when you are ready to actually start publishing datasets.

XML configuration

The safedata_zenodo generate_xml tool can be used to generate a geo-spatial XML metadata file for a dataset. This is relatively high-level metadata that just includes the temporal and spatial bounds of the data, along with some contact and access details. We recommend that this file is included when datasets are published. If you want to do this, you will need to update this section with the details for your own project.

The generated XML uses a template that is filled in with using project wide and dataset specific elements. We have tested this template using the INSPIRE validator tool using the "Common Requirements for ISO/TC 19139:2007" and "Conformance Class 1: 'Baseline metadata for data sets and data set series" test suites. This tool may be of use for validating your own XML configuration but does include some elements that are specific to the EU INSPIRE implementation of the more general ISO/TC 19139:2007 metadata specification.

The languageCode, characterSet and epsgCode elements: It is unlikely that you will need to change these, but they just identify the language used in the dataset, the character encoding of the metadata and the EPSG code of the geographic coordinate system used in the data. The default value of 4326 is the code for the widely used WGS84 datum.
The contactCountry and contactEmail elements: The XML includes a number of contact details, including the authors, but also requires a general point of contact. Some of these details (name and OrcID) are re-used from the Zenodo point of contact information above, but the XML validation requires a country and email, so these need to be provided here.
The projectURL element: This is optional - if you want to include a link in the XML to a project site to give context for the dataset, then include it here.
The topicCategories element: This is a troublesome element - it is just a list of topic categories, but different implementations of this XML standard have different list of acceptable values. If highly compliant XML is important to your project, you may need to identify the precise set of topics that this will be validated against.
The lineageStatement: The XML specification requires a lineage statement for the dataset. This could be a highly dataset specific record of the lineage of the data, but this entry is used to provide a generic statement intended to cover all of the dataset collected within a project.

Metadata configuration

Zenodo only allows a fairly limited amount of metadata to be stored for each dataset. While this is completely adequate to describe the contents of a dataset, more extensive metadata must be stored elsewhere if detailed searches within datasets are desired.

The safedata_server web application allows more detailed metadata to be made available to end users of the data and provides an API to aid data discovery and downloading. This API is used extensively by the safedata R package.

To use this system, you will need to deploy the web application to a publically accessible URL and then configure the following elements:

The api and token elements: These provide the URL of the metadata server API and an access token required to authenticate data upload to that server.
The ssl_verify element: In production use, the metadata server should be set up with a properly validated SSL certificate to allow HTTPS, and this is relatively easy using LetsEncrypt. However, when setting up and testing a system, requiring a valid certificate can be a road block and this element allows SSL verification to be turned off.

Configuring the safedata_validator package