Managing datasets with safedata_validator
The safedata_validator package is one component of the wider safedata system for
data management and discovery. This system comprises:
-
The
safedata_validatorPython package itself, which is used to validate submitted datasets and ensure that the data and metadata for those datasets are consistent and meet the minimum requirements. When a dataset is successfully validated, the package also provides tools to both publish the dataset to the Zenodo community for your datasets and to upload the metadata to a separate metadata server for the project. -
The metadata server is a web server running the
safedata_serverweb application. This provides an index of the published datasets along with a range of APIs to search the metadata, including text, taxonomy and spatial searches of the published datasets. -
The project Zenodo community: this is a project specific grouping of Zenodo records which provides DOIs and download access for the actual data files. Each project using the
safedatasystem will have it's own separate Zenodo community. -
The
safedataR package, which is an R package that makes it easy for users to discover and download datasets of interest from your community.
Installing and using safedata_validator
Installing and configuring the safedata_validator package has multiple steps. The
basic overview is:
-
Ensure that you have a recent version of Python installed on your computer and install the
safedata_validatorpackage from PyPi using thepippackage installer tool. -
Create a
safedata_validatorconfiguration file, which is used by the package and command line tools to locate required resources and settings. -
Use the
safedata_build_local_gbifcommand line tool to create a local taxonomic validation database and add the locations of this file to the configuration. -
Create a GeoJSON gazetteer for your project, defining named locations to be used across datasets, and add the location of the gazetteer file to the configuration.
At this point, you should be able to use the safedata_validate tool to validate
datasets. However, there are extra steps to allow datasets to be
published.
-
Create a Zenodo account, community and access token that will be used to publish validated datasets and add these details to your configuration.
-
Set up a
safedata_servermetadata server to provide a searchable API for the detailed dataset metadata and again add the details to your configuration.
Once you have installed and configured these tools, then you can use the provided command line tools to validate and publish datasets. The usage recipes show how the tools are used.
Example safedata format datasets
To help both data managers and data providers understand the safedata format we
provide a number of resources. Firstly, a template
dataset
containing the required worksheets, labels and headers. Secondly, an example
dataset
demonstrating how to correctly format a wide variety of different types of data. You can
also look at existing published datasets, such as those from the SAFE Project, to see
how the format is used: