Skip to content

Publishing a dataset

This page provide examples of using the safedata_validator package to publish validated datasets. The examples assume that a user has provided a SAFE formatted dataset and an additional ZIP file containing additional files:

  • Example.xlsx
  • Supplementary_files.zip

Both of the use cases below include the creation and includsion of a GEMINI compliant XML metadata file in a published dataset. We recommend this as good practice, but it is optional.

Validating and publishing as a new dataset

The safedata_zenodo publish_dataset command is the main function for publishing a dataset. It is important to note that the dataset metadata file must be provided first and the dataset file provided second, otherwise the publication process will fail. The example below shows it being used to publish a dataset and additional external files:

safedata_zenodo publish_dataset Example.json Example.xlsx \
    --external-file Supplementary_files.zip

The expected output from that command is shown below:

- Configuring Resources
    - Configuring resources from user config: configs/config.cfg
    - Validating gazetteer: spatial_resources/gazetteer.geojson
    - Validating location aliases: spatial_resources/location_aliases.csv
    - Validating GBIF database: gbif_databases/gbif_backbone_2021-11-26.sqlite
    - Validating project database: project_databases/safe_projects.csv
Deposit created: 1143714
XML created: 1143714_GEMINI.xml
Uploading files:
Uploading Example.xlsx
100%|███████████████████████████████████████| 160k/160k [00:00<00:00, 494kB/s]
Uploading Supplementary_files.zip
100%|████████████████████████████████████| 1.00k/1.00k [00:00<00:00, 3.47kB/s]
Uploading 1143714_GEMINI.xml
100%|████████████████████████████████████| 27.1k/27.1k [00:00<00:00, 83.4kB/s]
Uploading deposit metadata
Dataset published: https://zenodo.org/records/1143714

The publish_dataset subcommand packages up a set of operations needed to publish a dataset. The example code in the tabs below show the underlying workflow, either using the command line interface or working from within Python, for publishing these data and accompanying metadata as a completely new dataset using safedata_validator. You would typically not need to use these individual commands: this information is here to show what is going on under the hood.

#!/bin/sh

# The code below assumes that:
# * the safedata_validator tools find a configuration file in one of the 
#   standard locations. If not, the path can be specified explicitly 
#   using, the -r flag:
#     -r /path/to/safedata_validator_local_test.cfg 
# * the Example.xlsx file has been successfully validated, generating the
#   Example.json metadata file

# Publish the dataset to Zenodo
# 1) Create a new deposit, which will generate a deposit metadata file called
#    something like zenodo_1143714.json
safedata_zenodo create_deposit

# 2) Generate a GEMINI XML metadata file for the deposit
safedata_zenodo generate_xml zenodo_1143714.json Example.json 1143714_GEMINI.xml

# 3) Upload the dataset file, external files named in the dataset summary and the XML
#    metadata. This uses the zenodo metadata file to confirm the upload destination.
safedata_zenodo upload_files zenodo_1143714.json \
    Example.xlsx Supplementary_files.zip 1143714_GEMINI.xml

# 4) Update the Zenodo deposit webpage - this populates the deposit description
#    on Zenodo from the dataset metadata
safedata_zenodo upload_metadata zenodo_1143714.json Example.json

# 5) Finally, publish the deposit to create the final record and DOI
safedata_zenodo publish_deposit zenodo_1143714.json
"""Python script to publish a dataset using safedata_validator.

Note that this is essentially just a slimmed down version of the publish_dataset
function underlying the `safedata_zenodo publish_dataset` command line tool.
"""

from pathlib import Path

import simplejson

from safedata_validator.resources import Resources
from safedata_validator.zenodo import (
    ZenodoResources,
    create_deposit,
    generate_inspire_xml,
    publish_deposit,
    upload_files,
    upload_metadata,
)

# Local paths to the dataset file
dataset = "Example.xlsx"
metadata_path = "Example.json"
extra_file = "Supplementary_files.zip"
xml_file = "Example_GEMINI.xml"

# Create a Resources object from a configuration file in a standard location and convert
# to the Zenodo specific resource class
resources = Resources()
zenodo_resources = ZenodoResources(resources)

# Extract the validated dataset metadata
with open(metadata_path) as md_json:
    data_metadata = simplejson.load(md_json)

# Create the new deposit to publish the dataset
create_response = create_deposit(zen_res=zenodo_resources)

# Bail if unsuccessful
if not create_response.ok:
    raise RuntimeError(create_response.error_message)

# Extract the Zenodo metadata from the response
zenodo_metadata = create_response.json_data

# Generate XML
xml_content = generate_inspire_xml(
    dataset_metadata=data_metadata, zenodo_metadata=zenodo_metadata, resources=resources
)
with open(xml_file, "w") as xml_out:
    xml_out.write(xml_content)

# Post the files
files = [Path(f) for f in (dataset, extra_file, xml_file)]
file_upload_response = upload_files(
    zenodo=zenodo_metadata, filepaths=files, zen_res=zenodo_resources
)

if not file_upload_response.ok:
    raise RuntimeError(file_upload_response.error_message)

# Post the metadata
md_upload_response = upload_metadata(
    metadata=data_metadata, zenodo=zenodo_metadata, zen_res=zenodo_resources
)

if not md_upload_response.ok:
    raise RuntimeError(md_upload_response.error_message)

# Publish the deposit
publish_response = publish_deposit(zenodo=zenodo_metadata, zen_res=zenodo_resources)

if not publish_response.ok:
    raise RuntimeError(publish_response.error_message)

# Show the new publication
print(publish_response.json_data["links"]["html"])

Validating and publishing a new version of a dataset

Zenodo can hold multiple versions of a dataset, allowing you to publish updates and corrections. Each version of a dataset will have a different record ID and they are also grouped together under a shared concept record ID. One of the versions is always identified as the latest version - and actually the concept ID works as a DOI that always redirects to the latest version.

When you create a new version of a dataset, the Zenodo system creates an exact copy of the most recent version. Users can then update any files that need changing, remove outdated files, and update the metadata for the new deposit before publishing it.

In order to do this, you can provide the record ID of the most recent version of a dataset that you want to update. The most straightforward approach is to use the publish_dataset subcommand and add the --new-version (or -n) argument.

safedata_zenodo publish_dataset Example.json Example.xlsx \
    --external-file  Supplementary_files.zip \
    --new-version 1143714

The output from that command would look like:

- Configuring Resources
    - Configuring resources from user config: configs/config.cfg
    - Validating gazetteer: spatial_resources/gazetteer.geojson
    - Validating location aliases: spatial_resources/location_aliases.csv
    - Validating GBIF database: gbif_databases/gbif_backbone_2021-11-26.sqlite
    - Validating project database: project_databases/safe_projects.csv
Preparing new version of deposit 1143714
 - Unchanged files: Supplementary_files.zip
 - Removing outdated files: 1143714_GEMINI.xml, Example.xlsx
 - Uploading new or updated files: Example.xlsx
Deposit created: 1143900
XML created: 1143900_GEMINI.xml
Removing outdated files: 1143714_GEMINI.xml, Example.xlsx
Uploading files:
Uploading Example.xlsx
100%|███████████████████████████████████████| 160k/160k [00:00<00:00, 479kB/s]
Uploading 1143900_GEMINI.xml
100%|████████████████████████████████████| 27.1k/27.1k [00:00<00:00, 89.0kB/s]
Uploading deposit metadata
Dataset published: https://sandbox.zenodo.org/records/1143900

The publish_dataset subcommand does more complex checking when creating a new version of an existing dataset. Because the newly created deposit already contains copies of the most recent files, the command needs to check for:

  • completely new files to be uploaded,
  • existing files where the content has changed and which should be updated,
  • existing files that are not in the new publication request and which should be deleted, and
  • existing files that have not changed and can be left as is.

The subcommand will fail under a few circumstances:

  • The provided record ID is not the most recent version. The command automatically checks for most recent version of the provided ID and will stop if the provided version does not match the most recent version. It will print out what that most recent ID is, but it does not automatically assume this is what you meant!

  • If the files that you have provided to publish are identical to the existing files on the most recent version, then it will stop to avoid creating duplicate identical deposits. This step checks the name and the MD5 hash of each file against the existing file. The hash provides a unique signature for the contents of a file that allows the code to test for identical files. Using a different name for the file is currently accepted as a change but we don't advise publishing identical files under different names!

This check ignores any GEMINI XML file in most recent version. Since these are named using the record ID of the deposit and are generated from the data files, they will only differ in their file name.

As above, the tabs below show what is going on within that process. This is more involved than creating a new dataset because the existing files need to be deleted.

#!/bin/sh

# The code below assumes that:
# * the safedata_validator tools find a configuration file in one of the 
#   standard locations. If not, the path can be specified explicitly 
#   using, the -r flag:
#     -r /path/to/safedata_validator_local_test.cfg 
# * the Example.xlsx file has been successfully validated, generating the
#   Example.json metadata file


# Publish the dataset to Zenodo
# 1) Create a new deposit as a new version of the most recent version of an existing
#    record. Again this will generate a deposit metadata file called something like
#    zenodo_1156212.json
safedata_zenodo create_deposit --new-version 1143714

# 2) Generate a GEMINI XML metadata file for the deposit
safedata_zenodo generate_xml zenodo_1156212.json Example.json 1156212_GEMINI.xml

# 3) Delete the existing files on the deposit. This uses the zenodo metadata file to
#    confirm the upload destination. 
#
#    Note that in this example, we do not check to see if any of the files are
#    identical. The command below simple deletes all of the files and then reuploads the
#    provided versions. The publish_dataset subcommand handles this in a much more
#    sophisticated way.
safedata_zenodo delete_files zenodo_1156212.json \
    Example.xlsx  Supplementary_files.zip 1143714_GEMINI.xml

# 4) Upload the dataset file, external files named in the dataset summary and the new
#    XML metadata. This uses the zenodo metadata file to confirm the upload destination.
safedata_zenodo upload_files zenodo_1156212.json \
    Example.xlsx  Supplementary_files.zip 1156212_GEMINI.xml

# 3) Update the Zenodo deposit webpage - this populates the deposit description
#    on Zenodo from the dataset metadata
safedata_zenodo upload_metadata zenodo_1156212.json Example.json

# 4) Finally, publish the deposit to create the final record and DOI
safedata_zenodo publish_deposit zenodo_1156212.json
"""Python script to publish a new version of a dataset using safedata_validator."""

from pathlib import Path

import simplejson

from safedata_validator.resources import Resources
from safedata_validator.zenodo import (
    ZenodoResources,
    create_deposit,
    delete_files,
    generate_inspire_xml,
    publish_deposit,
    upload_files,
    upload_metadata,
)

# Local paths to the files to be published
dataset = "Example.xlsx"
metadata_path = "Example.json"
extra_file = "Supplementary_files.zip"
xml_file = "Example_GEMINI.xml"

# Create a Resources object from a configuration file in a standard location and then
# get the Zenodo specific resources from that
resources = Resources()
zenodo_resources = ZenodoResources(resources=resources)

# Extract the validated dataset metadata
with open(metadata_path) as md_json:
    data_metadata = simplejson.load(md_json)

# Create a new version of an existing dataset using the record ID of the most recent
# version
create_response = create_deposit(new_version=1143714, zen_res=zenodo_resources)

# Bail if unsuccessful
if not create_response.ok:
    raise RuntimeError(create_response.error_message)

# Extract the Zenodo metadata from the response
zenodo_metadata = create_response.json_data

# Generate XML
xml_content = generate_inspire_xml(
    dataset_metadata=data_metadata, zenodo_metadata=zenodo_metadata, resources=resources
)
with open(xml_file, "w") as xml_out:
    xml_out.write(xml_content)

# Get the names of the existing files from the JSON metadata of the deposit and delete
# them all before uploading the provided versions. Note that the publish_dataset
# function does this in a much more sophisticated way.
existing_online_files = [p["key"] for p in zenodo_metadata["files"]]
file_delete_response = delete_files(
    metadata=zenodo_metadata, filenames=existing_online_files, zen_res=zenodo_resources
)

# Bail if unsuccessful
if not file_delete_response.ok:
    raise RuntimeError(file_delete_response.error_message)

# Post the files
files = [Path(f) for f in (dataset, extra_file, xml_file)]
file_upload_response = upload_files(
    zenodo=zenodo_metadata, filepaths=files, zen_res=zenodo_resources
)

# Bail if unsuccessful
if not file_upload_response.ok:
    raise RuntimeError(file_upload_response.error_message)

# Post the metadata
md_upload_response = upload_metadata(
    metadata=data_metadata, zenodo=zenodo_metadata, zen_res=zenodo_resources
)

if not md_upload_response.ok:
    raise RuntimeError(file_upload_response.error_message)

# Publish the deposit
publish_response = publish_deposit(zenodo=zenodo_metadata, zen_res=zenodo_resources)

if not publish_response.ok:
    raise RuntimeError(publish_response.error_message)

# Show the new publication
print(publish_response.json_data["links"]["html"])

Amending metadata of published deposits

Once a deposit has been published to Zenodo its metadata cannot be changed using tools provided by the safedata_validator package.

You can edit the metadata through the Zenodo web interface, but this does mean that the Zenodo metadata will no longer match the metadata included in the data files. If you need to change the metadata associated with a published deposit, we generally strongly recommend uploading a new dataset version with updated metadata.

We would probably make an exception for simply updating the access status of a published record, for example to release a dataset from embargo ahead of the scheduled date or to remove restrictions from a dataset.