Skip to content

Sequenced taxa worksheets

Like the GBIFTaxa worksheet, a sequenced taxa worksheet is used to record taxonomic details of organisms referred to in the Data worksheets, but is intended for use with taxa identified through sequencing rather than field observation. Like the GBIFTaxa worksheet, the taxonomic information is used to:

  • Cross-check the listed taxa against taxon fields in the data worksheets to confirm that there is a complete list of the taxa recorded in the datasets.
  • Generate an hierarchical taxon index that can be used to search datasets for taxa of interest.

However, unlike the GBIF taxa worksheets, safedata_validator does not validate the provided taxa against the underlying bioinformatic datasets. Latin binomial names from field observations can be matched against any broad taxonomic database - we use GBIF because it is well-curated and updated - but taxonomic ids from sequencing can only really be validated against the specific bioinformatic database and version used by the researchers. This is beyond the scope of safedata_validator, so no validation is performed. We instead expect researchers to provide simple information in the Summary data about the bioinformatic dataset used to generate sequenced taxa id, which could be used to revisit taxon identification.

A sequenced taxa worksheet is used to provide a table of taxonomic ranks for matched sequences. This is a common export format from bioinformatics workflows and allows safedata_validator to provide an hierarchical taxon index for these taxa as well as providing a list of sequenced taxon names to be checked against the taxon names provided in data worksheets.

Multiple sequenced taxa worksheets can be provided. The names of these sheets and the details (name, version, etc) of the reference databases used to generate them should be provided in the Summary metadata. If taxa are used anywhere in the dataset at least one worksheet of this type or the GBIFTaxa worksheet must be included. It is also an option to provide both a GBIFTaxa worksheet and one or more sequenced taxa worksheets, where taxa identified through both sequencing and field observation are being reported.

We would also encourage you (where possible) to include the raw sequencing data used to generate your taxonomies, as it improves replicability to have the raw genomic information in addition to your taxonomic assignment of sequences presented in the sequenced taxa worksheets. Sequencing data should be included as external files, and you should note in the comments which sequence worksheet it provides raw data for.

Taxon table layout

The table format looks like this:

Name Kingdom Phylum Class Order Family Genus Species Comments
ASV_100 k__Fungi p__Basidiomycota c__Tremellomycetes o__Tremellales f__Trimorphomycetaceae g__Saitozyma s__podzolica
ASV_101 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Hypocreaceae g__Trichoderma s__spirale
ASV_102 k__Fungi p__Basidiomycota c__Tremellomycetes o__Trichosporonales f__Trichosporonaceae g__Apiotrichum s__sporotrichoides
ASV_103 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Hypocreaceae g__Trichoderma NA
ASV_104 k__Fungi p__Mortierellomycota c__Mortierellomycetes o__Mortierellales f__Mortierellaceae g__Mortierella s__elongata
ASV_105 k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Talaromyces NA
ASV_106 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Ophiocordycipitaceae g__Purpureocillium s__lilacinum
ASV_107 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Hypocreaceae g__Trichoderma s__harzianum
ASV_108 k__Fungi p__Ascomycota c__Geoglossomycetes o__Geoglossales f__Geoglossaceae g__Geoglossum s__difforme
ASV_109 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Hypocreaceae g__Trichoderma s__harzianum Example comment
ASV_110 k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Cucurbitariaceae g__Pyrenochaetopsis s__leptospora
ASV_111 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Hypocreaceae g__Trichoderma s__koningiopsis
ASV_113 k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Clavicipitaceae g__Metarhizium s__carneum
ASV_114 k__Fungi p__Ascomycota c__Sordariomycetes o__Chaetosphaeriales f__Chaetosphaeriaceae g__Chloridium NA

The table must contain column headers in the first row of the worksheet. The Name column is mandatory and must contain a local name for all of the taxa that you are going to use in the rest of the dataset, aside those that are described in another sequenced taxa worksheet or in a GBIFTaxa worksheet. The other columns are to help us maintain a taxonomic index for the taxa used in your datasets.

  • Name: This column is mandatory and these are the names that you are going to use in your data worksheet. These can be abbreviations or codes: if you want to use E_coli_FaMsAh in your data worksheets, rather than typing out Escherichia coli FaMsAh gene for 16S rRNA, LC842012 every time, then that is fine.

Please note that different worksheet names need to be used if a taxon is detected in multiple ways and hence represented in multiple taxa worksheets.

  • Ranks: Here the column name (e.g. Phylum) provides a taxonomic rank, and the row entries provide the relevant names for this rank. A top level rank must be provided, this is one of domain, superkingdom or kingdom. If either domain or superkingdom is provided kingdom can also be provided. The other thing ranks that can be provided are the standard backbone ranks (phylum down to species). It is up to you which of these you provide but if a rank is provided every higher backbone rank must be provided, e.g. if order is provided as a rank class and phylum must also be provided. Non-backbone ranks can be provided (e.g. subspecies or strain) but they are treated as additional information and are therefore not validated.

    Note

    Missing rank entries are generally completely fine, e.g. if you have genera that haven't been assigned to a family level no entry has to be provided for the rank. However, an entry for highest taxonomic rank has to be provided for every row.

    Names can be provided in plain text, or alternatively in a commonly used notation, where the rank is indicated by a lower case first letter and the name follows after two underscores (e.g. k__Bacteria for Kingdom Bacteria). Notation of this type should be placed in the correct rank columns, and validation is carried out to check that the rank implied by the notation matches the column rank.

    Entries for species rank should not be provided as binomials! Instead, the species name and the genus name should be provided separately, and the validator will construct the binomial automatically for the searchable metadata. Standard taxonomic tags (e.g. "candidatus") are fine to include as part of the name, however they are removed from the searchable metadata.

  • Comments and other fields: These fields are obviously optional. If you do have particular notes that you want to make - new species notes and the like - then these can be very useful for future researchers trying to place taxa. Equally if you want to record further information about the taxon rows, you can add additional fields as long as they do not duplicate any of the field names mentioned above.

My data is not sequencing data

You should record this data using GBIF format on a GBIFTaxa worksheet instead.

My data doesn't contain taxa

Fine. You can omit both the GBIFTaxa worksheet and the sequenced taxa worksheets!