Skip to content

The summary submodule

This module handles the parsing and validation of the summary table for a Dataset. The table consists of rows of information, with row labels in the first column. The module defines the single Summary object which provides methods for loading the summary data from file.

Summary

Interface for dataset summary metadata.

This class provides an interface to the summary metadata for a dataset. The loading methods check the information provided in the summary worksheet and populates the attributes of the class instance to pass that information to other components.

The methods are intended to try and get as much information as possible from the Summary table: the instance attributes may therefore be set to None for missing metadata, so classes using Summary should handle None values.

Parameters:

Name Type Description Default
resources Resources

An instance of Resources providing the safedata_validator configuration.

required
Source code in safedata_validator/summary.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
class Summary:
    """Interface for dataset summary metadata.

    This class provides an interface to the summary metadata for a dataset. The loading
    methods check the information provided in the summary worksheet and populates the
    attributes of the class instance to pass that information to other components.

    The methods are intended to try and get as much information as possible from the
    Summary table: the instance attributes may therefore be set to None for missing
    metadata, so classes using Summary should handle None values.

    Args:
        resources: An instance of Resources providing the safedata_validator
            configuration.
    """

    def __init__(self, resources: Resources) -> None:
        self.title: str
        """A string giving the dataset title."""
        self.description: str
        """A string giving a description of the dataset."""
        self.access: dict
        """A dictionary giving access metadata."""
        self.authors: list[dict]
        "A list of dictionaries of author metadata."
        self.permits: list[dict]
        """A list of dictionaries of research permit metadata."""
        self.publication_doi = None
        """A list of DOIs associated with the dataset."""
        self.funders = None
        """A list of dictionaries of funder metadata."""
        self.keywords: list[str]
        """A list of keyword strings."""
        self.temporal_extent: Extent = Extent(
            "temporal extent",
            (datetime.date,),
            hard_bounds=resources.extents.temporal_hard_extent,
            soft_bounds=resources.extents.temporal_soft_extent,
        )
        """Extent instance for the temporal extent of the Dataset."""
        self.latitudinal_extent: Extent = Extent(
            "latitudinal extent",
            (float, int),
            hard_bounds=resources.extents.latitudinal_hard_extent,
            soft_bounds=resources.extents.latitudinal_soft_extent,
        )
        """Extent instance for the latitudinal extent of the Dataset."""
        self.longitudinal_extent: Extent = Extent(
            "longitudinal extent",
            (float, int),
            hard_bounds=resources.extents.longitudinal_hard_extent,
            soft_bounds=resources.extents.longitudinal_soft_extent,
        )
        """Extent instance for the longitudinal extent of the Dataset."""
        self.external_files: list[dict] | None = None
        """A list of dictionaries of external file metadata."""
        self.data_worksheets: list[Worksheet] = []
        """A list of dictionaries of data tables in the Dataset."""

        self._rows: dict = {}
        """A private attribute holding the row data for the summary."""
        self._ncols: int
        """A private attribute holding the total number of columns in the summary."""
        self.n_errors: int = 0
        """The number of validation errors found in the summary."""
        self.projects: dict[int, str] = resources.projects
        """A dictionary of valid project data."""
        self.project_id: list[int] | None = None
        """A list of project ID codes, if project IDs are configured."""

        self.validate_doi = False
        """A boolean flag indicating whether DOI values should be validated."""

        # Define the blocks and fields in the summary - note that the project ids block
        # is mandatory if a project database has been populated.
        self.fields: dict[str, SummaryBlock] = dict(
            core=SummaryBlock(
                fields=[
                    SummaryField("title", True, None, str),
                    SummaryField("description", True, None, str),
                ],
                mandatory=True,
                title="Core fields",
                singular=True,
            ),
            project_ids=SummaryBlock(
                fields=[
                    SummaryField("safe project id", False, None, int),
                    SummaryField("project id", False, None, int),
                ],
                mandatory=True if self.projects else False,
                title="Project IDs",
                singular=False,
            ),
            access=SummaryBlock(
                fields=[
                    SummaryField("access status", True, "access", str),
                    SummaryField(
                        "embargo date", False, "embargo_date", datetime.datetime
                    ),
                    SummaryField("access conditions", False, "access_conditions", str),
                ],
                mandatory=True,
                title="Access details",
                singular=True,
            ),
            keywords=SummaryBlock(
                fields=[SummaryField("keywords", True, None, str)],
                mandatory=True,
                title="Keywords",
                singular=False,
            ),
            doi=SummaryBlock(
                fields=[SummaryField("publication doi", True, None, str)],
                mandatory=False,
                title="DOI",
                singular=False,
            ),
            date=SummaryBlock(
                fields=[
                    SummaryField("start date", True, None, datetime.datetime),
                    SummaryField("end date", True, None, datetime.datetime),
                ],
                mandatory=False,
                title="Date Extents",
                singular=True,
            ),
            geo=SummaryBlock(
                fields=[
                    SummaryField("west", True, None, float),
                    SummaryField("east", True, None, float),
                    SummaryField("south", True, None, float),
                    SummaryField("north", True, None, float),
                ],
                mandatory=False,
                title="Geographic Extents",
                singular=True,
            ),
            authors=SummaryBlock(
                fields=[
                    SummaryField("author name", True, "name", str),
                    SummaryField("author affiliation", False, "affiliation", str),
                    SummaryField("author email", False, "email", str),
                    SummaryField("author orcid", False, "orcid", str),
                ],
                mandatory=True,
                title="Authors",
                singular=False,
            ),
            funding=SummaryBlock(
                fields=[
                    SummaryField("funding body", True, "body", str),
                    SummaryField("funding type", True, "type", str),
                    SummaryField("funding reference", False, "ref", (str, int, float)),
                    SummaryField("funding link", False, "url", str),
                ],
                mandatory=False,
                title="Funding Bodies",
                singular=False,
            ),
            external=SummaryBlock(
                fields=[
                    SummaryField("external file", True, "file", str),
                    SummaryField("external file description", True, "description", str),
                ],
                mandatory=False,
                title="External Files",
                singular=False,
            ),
            worksheet=SummaryBlock(
                fields=[
                    SummaryField("worksheet name", True, "name", str),
                    SummaryField("worksheet title", True, "title", str),
                    SummaryField("worksheet description", True, "description", str),
                    SummaryField("worksheet external file", False, "external", str),
                ],
                mandatory=False,
                title="Worksheets",
                singular=False,
            ),
            permits=SummaryBlock(
                fields=[
                    SummaryField("permit type", True, "type", str),
                    SummaryField("permit authority", True, "authority", str),
                    SummaryField("permit number", True, "number", (str, int, float)),
                ],
                mandatory=False,
                title="Permits",
                singular=False,
            ),
        )
        """A dictionary setting the summary blocks that can be present."""

    @loggerinfo_push_pop("Checking Summary worksheet")
    def load(
        self,
        worksheet: Worksheet,
        sheetnames: set,
        validate_doi: bool = False,
    ) -> None:
        """Populate a Summary instance from an Excel Worksheet.

        Args:
            worksheet: An openpyxl worksheet instance.
            sheetnames: A set of sheet names found in the workbook.
            validate_doi: Should publication DOIs be validated (needs web connection).
        """
        handler = get_handler()
        start_errors = handler.counters["ERROR"]

        self.validate_doi = validate_doi

        rows = load_rows_from_worksheet(worksheet)

        self._ncols = worksheet.max_column

        # convert into dictionary using the lower cased first entry as the key after
        # checking for empty values (None) and non-string values.
        row_headers = IsString([r[0] for r in rows])
        if not row_headers:
            # Check for None separately because seeing 'None' as a field key in the
            # report is very confusing for end users.
            if None in row_headers.failed:
                LOGGER.error("Summary metadata fields column contains empty cells")
                row_headers

            # Get other non-string headers.
            non_none_failed_headers = [
                hdr for hdr in row_headers.failed if hdr is not None
            ]
            if non_none_failed_headers:
                LOGGER.error(
                    "Summary metadata fields column contains non text values: ",
                    extra={"join": non_none_failed_headers},
                )

        # Now we have warned about bad values, explicitly cast row keys to string to
        # continue processing.
        self._rows = {str(rw[0]).lower(): rw[1:] for rw in rows}

        # Check if metadata keys have white space padding
        self._check_for_whitespace()

        # Validate the keys found in the summary table
        self._validate_keys()

        # Now process the field blocks
        self._load_core()
        self._load_project_ids()
        self._load_access_details()
        self._load_authors()
        self._load_keywords()
        self._load_doi()
        self._load_temporal_extent()
        self._load_geographic_extent()
        self._load_funders()
        self._load_permits()
        self._load_external_files()
        self._load_data_worksheets(sheetnames)

        # summary of processing
        self.n_errors = handler.counters["ERROR"] - start_errors
        if self.n_errors > 0:
            LOGGER.info(f"Summary contains {self.n_errors} errors")
        else:
            LOGGER.info("Summary formatted correctly")

    def _check_for_whitespace(self) -> None:
        """Check that the summary keys do not have whitespace padding.

        This function checks that the keys in a summary table do not have white space
        padding, if they do the white space padding is removed and an error is logged.
        """

        clean_metadata_keys = IsNotPadded(self._rows.keys())
        if not clean_metadata_keys:
            # Report whitespace padding and clean up tuples
            LOGGER.error(
                "Whitespace padding in summary field names: ",
                extra={"join": clean_metadata_keys.failed},
            )

            # Order preserved in dict and validator
            cleaned_entries = [
                (ky, val) for ky, val in zip(clean_metadata_keys, self._rows.values())
            ]
            self._rows = dict(cleaned_entries)

    def _validate_keys(self) -> None:
        """Validate the summary keys recovered.

        This function checks that the keys in a summary table include the minimum set of
        mandatory fields in mandatory blocks and that all found keys are known.
        """

        # Populate found, required and known field keys
        found: set[str] = set(self._rows.keys())
        required: set[str] = set()
        known: set[str] = set()

        for block in self.fields.values():
            # Required keys
            if block.mandatory:
                required = required.union(
                    fld.key for fld in block.fields if fld.mandatory
                )
            # Known keys
            known = known.union(fld.key for fld in block.fields)

        # Look for and report on issue
        missing = required - found
        unknown = found - known
        if missing:
            LOGGER.error("Missing mandatory metadata fields: ", extra={"join": missing})

        if unknown:
            LOGGER.error("Unknown metadata fields: ", extra={"join": unknown})

    def _read_block(self, block: SummaryBlock) -> list | None:
        """Read a block of fields from a summary table.

        This internal method takes a given block definition from the Summary class
        [fields][safedata_validator.summary.Summary.fields] attribute and returns a list
        of dictionary records for that block. This function automatically does some
        common checking for missing data, bad input types etc, leaving the block
        specific functions to handle unique tests.

        Args:
            block: A SummaryBlock instance describing the block
        """

        mandatory_fields = [fld.key for fld in block.fields if fld.mandatory]
        optional_fields = [fld.key for fld in block.fields if not fld.mandatory]
        field_map = [
            (fld.key, fld.internal) for fld in block.fields if fld.internal is not None
        ]
        field_types = {fld.key: fld.types for fld in block.fields}

        # Get the full list of field names
        all_fields = mandatory_fields + optional_fields

        # Get the data, filling in completely missing rows
        data = {
            k: self._rows[k] if k in self._rows else [None] * (self._ncols - 1)
            for k in all_fields
        }

        # Empty cells are already None, but also filter values to catch
        # pure whitespace content and replace with None
        for ky, vals in data.items():
            vals = IsNotSpace(vals)
            if not vals:
                LOGGER.error(f"Whitespace only cells in field {ky}")

            data[ky] = vals.values

        # Pivot to dictionary of records
        block_list = [dict(zip(data.keys(), vals)) for vals in zip(*data.values())]

        # Drop empty records
        block_list = [bl for bl in block_list if any(bl.values())]

        # Contine if data are present
        if not block_list:
            if block.mandatory:
                LOGGER.error(f"No {block.title} metadata found")
            else:
                LOGGER.info(f"No {block.title} metadata found")
            return None
        else:
            LOGGER.info(f"Metadata for {block.title} found: {len(block_list)} records")

            if len(block_list) > 1 and block.singular:
                LOGGER.error("Only a single record should be present")

            # report on block fields
            for fld in mandatory_fields:
                fld_values = [rec[fld] for rec in block_list]
                if not all(fld_values):
                    LOGGER.error(f"Missing metadata in mandatory field {fld}")

            # report on actual data that is of the wrong type
            for fld in all_fields:
                bad_values = [
                    rec[fld]
                    for rec in block_list
                    if rec[fld] is not None
                    and not isinstance(rec[fld], field_types[fld])
                ]

                if bad_values:
                    LOGGER.error(
                        f"Field {fld} contains values of wrong type: ",
                        extra={"join": bad_values},
                    )

            # remap names if provided
            for old, new in field_map:
                for rec in block_list:
                    rec[new] = rec[old]
                    rec.pop(old)

            return block_list

    @loggerinfo_push_pop("Loading author metadata")
    def _load_authors(self):
        """Load the author block.

        Provides summary validation specific to the author block.
        """
        authors = self._read_block(self.fields["authors"])

        # Author specific validation
        if authors is not None:
            # Badly formatted names
            bad_names = [
                rec["name"]
                for rec in authors
                if isinstance(rec["name"], str) and not RE_NAME.match(rec["name"])
            ]
            if bad_names:
                LOGGER.error(
                    "Author names not formatted as last_name, first_names: ",
                    extra={"join": bad_names},
                )

            # Emails not formatted properly
            bad_emails = [
                rec["email"]
                for rec in authors
                if isinstance(rec["email"], str) and not RE_EMAIL.match(rec["email"])
            ]
            if bad_emails:
                LOGGER.error(
                    "Author emails not properly formatted: ", extra={"join": bad_emails}
                )

            # ORCIDs not strings
            bad_orcid = [
                rec["orcid"]
                for rec in authors
                if isinstance(rec["orcid"], str) and not RE_ORCID.match(rec["orcid"])
            ]
            if bad_orcid:
                LOGGER.error(
                    "Author ORCIDs not properly formatted: ", extra={"join": bad_orcid}
                )

        self.authors = authors

    @loggerinfo_push_pop("Loading keywords metadata")
    def _load_keywords(self):
        """Load the keywords block.

        Provides summary validation specific to the keywords block.
        """
        keywords = self._read_block(self.fields["keywords"])

        # extra data validation for keywords
        if keywords:
            keywords = [rec["keywords"] for rec in keywords]
            keywords = NoPunctuation(keywords)
            if not keywords:
                LOGGER.error(
                    "Put each keyword in a separate cell, do not separate "
                    "keywords using commas or semi-colons"
                )

            self.keywords = keywords.values

    @loggerinfo_push_pop("Loading permit metadata")
    def _load_permits(self):
        """Load the permits block.

        Provides summary validation specific to the permits block - users provide a
        permit authority, number and permit type.
        """

        permits = self._read_block(self.fields["permits"])

        # Permit specific checking for allowed permit types
        if permits:
            permit_types = [
                rec["type"].lower() for rec in permits if isinstance(rec["type"], str)
            ]
            valid_permit_types = {"research", "export", "ethics"}
            if not set(permit_types).issubset(valid_permit_types):
                LOGGER.error(
                    "Unknown permit types: ",
                    extra={"join": permit_types - valid_permit_types},
                )

        self.permits = permits

    @loggerinfo_push_pop("Loading DOI metadata")
    def _load_doi(self):
        """Load the DOI block.

        Provides summary validation specific to the DOI block.
        """
        # CHECK FOR PUBLICATION DOIs
        pub_doi = self._read_block(self.fields["doi"])

        # Extra data validation for DOIs
        if pub_doi is not None:
            # Check DOI URLS _are_ urls
            pub_doi_re = [
                RE_DOI.search(v["publication doi"])
                for v in pub_doi
                if isinstance(v["publication doi"], str)
            ]
            if not all(pub_doi_re):
                LOGGER.error("Publication DOIs not all in format: https://doi.org/...")

            if self.validate_doi:
                for is_doi in pub_doi_re:
                    if is_doi:
                        api_call = (
                            f"https://doi.org/api/handles/"
                            f"{is_doi.string[is_doi.end():]}"
                        )
                        r = requests.get(api_call)
                        if r.json()["responseCode"] != 1:
                            LOGGER.error(f"DOI not found: {is_doi.string}")

        self.publication_doi = pub_doi

    @loggerinfo_push_pop("Loading funding metadata")
    def _load_funders(self):
        """Load the funders block.

        Provides summary validation specific to the permits block - users provide a
        permit authority, number and permit type.
        """

        # LOOK FOR FUNDING DETAILS - users provide a funding body and a description
        # of the funding type and then optionally a reference number and a URL

        funders = self._read_block(self.fields["funding"])

        # TODO - currently no check beyond _read_block but maybe actually check
        #        the URL is a URL and maybe even opens? Could use urllib.parse

        self.funders = funders

    @loggerinfo_push_pop("Loading temporal extent metadata")
    def _load_temporal_extent(self):
        """Load the temporal extent block.

        Provides summary validation specific to temporal extents.
        """
        temp_extent = self._read_block(self.fields["date"])

        # temporal extent validation and updating
        if temp_extent is not None:
            start_date = temp_extent[0]["start date"]
            end_date = temp_extent[0]["end date"]

            if not (
                isinstance(start_date, datetime.datetime)
                and isinstance(end_date, datetime.datetime)
            ):
                LOGGER.error("Temporal extents are not date values")
                return

            if not (
                start_date.time() == datetime.time(0, 0)
                and end_date.time() == datetime.time(0, 0)
            ):
                LOGGER.error("Temporal extents should be date not datetime values")
                return

            if start_date > end_date:
                LOGGER.error("Start date is after end date")
                return

            self.temporal_extent.update([start_date.date(), end_date.date()])

    @loggerinfo_push_pop("Loading geographic extent metadata")
    def _load_geographic_extent(self):
        """Load the geographic extents block.

        Provides summary validation specific to geographic extents block.
        """
        geo_extent = self._read_block(self.fields["geo"])

        if geo_extent is not None:
            bbox = geo_extent[0]

            if all([isinstance(v, float) for v in bbox.values()]):
                if bbox["south"] > bbox["north"]:
                    LOGGER.error("South limit is greater than north limit")
                else:
                    self.latitudinal_extent.update([bbox["south"], bbox["north"]])

                if bbox["west"] > bbox["east"]:
                    LOGGER.error("West limit is greater than east limit")
                else:
                    self.longitudinal_extent.update([bbox["west"], bbox["east"]])

    @loggerinfo_push_pop("Loading external file metadata")
    def _load_external_files(self):
        """Load the external files block.

        Provides summary validation specific to the external files block. Small datasets
        will usually be contained entirely in a single Excel file, but where formatting
        or size issues require external files, then names and descriptions are included
        in the summary information.
        """

        external_files = self._read_block(self.fields["external"])

        # external file specific validation - no internal spaces.
        if external_files is not None:
            bad_names = [
                exf["file"]
                for exf in external_files
                if isinstance(exf["file"], str)
                and RE_CONTAINS_WSPACE.search(exf["file"])
            ]
            if any(bad_names):
                LOGGER.error(
                    "External file names must not contain whitespace: ",
                    extra={"join": bad_names},
                )

        self.external_files = external_files

    @loggerinfo_push_pop("Loading data worksheet metadata")
    def _load_data_worksheets(self, sheetnames):
        """Load the worksheets block.

        Provides summary validation specific to the worksheets block.
        """
        data_worksheets = self._read_block(self.fields["worksheet"])

        # Strip out faulty inclusion of Taxa and Location worksheets in
        # data worksheets before considering combinations of WS and external files
        if data_worksheets is not None:
            cited_sheets = [ws["name"] for ws in data_worksheets]

            if ("Locations" in cited_sheets) or ("Taxa" in cited_sheets):
                LOGGER.error(
                    "Do not include Taxa or Locations metadata sheets in "
                    "Data worksheet details"
                )

                data_worksheets = [
                    ws
                    for ws in data_worksheets
                    if ws["name"] not in ("Locations", "Taxa")
                ] or None

        # Look to see what data is available - must be one or both of data worksheets
        # or external files and validate worksheets if present.
        cited_sheets = set()

        if data_worksheets is None and self.external_files is None:
            LOGGER.error("No data worksheets or external files provided - no data.")
            return
        elif data_worksheets is None:
            LOGGER.info("Only external file descriptions provided")
            return

        # Check sheet names in list of sheets
        cited_sheets = {ws["name"] for ws in data_worksheets}

        # Names not in list of sheets
        for each_ws in data_worksheets:
            # Now match to sheet names
            if each_ws["name"] not in sheetnames:
                if each_ws["external"] is not None:
                    LOGGER.info(
                        f"Worksheet summary {each_ws['name']} recognized as placeholder"
                        f" for external file {each_ws['external']}"
                    )
                else:
                    LOGGER.error(f"Data worksheet {each_ws['name']} not found")

        # bad external files
        external_in_sheet = {
            ws["external"] for ws in data_worksheets if ws["external"] is not None
        }
        if self.external_files is not None:
            external_names = {ex["file"] for ex in self.external_files}
        else:
            external_names = set()

        bad_externals = external_in_sheet - external_names
        if bad_externals:
            LOGGER.error(
                "Worksheet descriptions refer to unreported external files: ",
                extra={"join": bad_externals},
            )

        # Check for existing sheets without description
        extra_names = (
            set(sheetnames)
            - {"Summary", "Taxa", "GBIFTaxa", "NCBITaxa", "Locations"}
            - cited_sheets
        )
        if extra_names:
            LOGGER.error(
                "Undocumented sheets found in workbook: ", extra={"join": extra_names}
            )

        self.data_worksheets = data_worksheets

    @loggerinfo_push_pop("Loading access metadata")
    def _load_access_details(self):
        """Load the access block.

        Provides summary validation specific to the access block.
        """

        access = self._read_block(self.fields["access"])
        access = access[0]

        # Access specific validation - bad types handled by _read_block
        # - status must be in list of three accepted values
        if isinstance(access["access"], str):
            status = access["access"].lower()
            embargo_date = access["embargo_date"]

            if status not in ["open", "embargo", "restricted"]:
                LOGGER.error(
                    f"Access status must be Open, Embargo or Restricted not "
                    f"{access['access']}"
                )

            if status == "embargo":
                if embargo_date is None:
                    LOGGER.error("Dataset embargoed but no embargo date provided")
                elif isinstance(embargo_date, datetime.datetime):
                    now = datetime.datetime.now()

                    if embargo_date < now:
                        LOGGER.error("Embargo date is in the past.")
                    elif embargo_date > now + datetime.timedelta(days=2 * 365):
                        LOGGER.error("Embargo date more than two years in the future.")
                    else:
                        LOGGER.info(f"Dataset access: embargoed until {embargo_date}")

                if access["access_conditions"] is not None:
                    LOGGER.error("Access conditions cannot be set on embargoed data.")

            elif status == "restricted":
                access_conditions = access["access_conditions"]

                if embargo_date is not None:
                    LOGGER.error("Do not set an embargo date with restricted datasets")

                if access_conditions is None:
                    LOGGER.error(
                        "Dataset restricted but no access conditions specified"
                    )
                else:
                    LOGGER.info(
                        f"Dataset access: restricted with conditions "
                        f"{access_conditions}"
                    )
            else:
                LOGGER.info(f"Dataset access: {status}")

        self.access = access

    @loggerinfo_push_pop("Loading core metadata")
    def _load_core(self):
        """Load the core block.

        Provides summary validation specific to the core block.
        """

        core = self._read_block(self.fields["core"])

        # Guard against all rows being absent.
        if core is None:
            return

        self.title = core[0]["title"]
        self.description = core[0]["description"]

    @loggerinfo_push_pop("Loading project id metadata")
    def _load_project_ids(self):
        """Load the project ids block.

        Provides summary validation specific to the project ids block.
        """

        project_data = self._read_block(self.fields["project_ids"])

        if not self.projects and project_data is None:
            LOGGER.info("No project id data required or provided.")
            return

        # Bail if no validation should be done, warning if data provided.
        if not self.projects and project_data is not None:
            LOGGER.error("Project ids are not required but are provided.")
            return

        # Bail if validation should be done but no data provided
        if self.projects and project_data is None:
            LOGGER.error("Project ids are required but not provided.")
            return

        # Now try and validate what is found in the data
        bare_pids = [
            p["project id"] for p in project_data if p["project id"] is not None
        ]
        safe_pids = [
            p["safe project id"]
            for p in project_data
            if p["safe project id"] is not None
        ]

        if bare_pids and safe_pids:
            LOGGER.error(
                "Both 'project id' and 'safe project id' provided: "
                "use only 'project id'."
            )
            proj_ids = bare_pids + safe_pids
        elif safe_pids:
            LOGGER.warning(
                "Use 'project id' rather than the legacy 'safe project id' key."
            )
            proj_ids = safe_pids
        elif bare_pids:
            proj_ids = bare_pids

        # Check any provided values are valid
        invalid_proj_ids = [p for p in proj_ids if p not in self.projects]
        valid_proj_ids = [p for p in proj_ids if p in self.projects]
        if invalid_proj_ids:
            LOGGER.error(
                "Unknown project ids provided: ", extra={"join": invalid_proj_ids}
            )

        self.project_id = valid_proj_ids
        LOGGER.info("Valid project ids provided: ", extra={"join": valid_proj_ids})

access: dict instance-attribute

A dictionary giving access metadata.

authors: list[dict] instance-attribute

A list of dictionaries of author metadata.

data_worksheets: list[Worksheet] = [] instance-attribute

A list of dictionaries of data tables in the Dataset.

description: str instance-attribute

A string giving a description of the dataset.

external_files: list[dict] | None = None instance-attribute

A list of dictionaries of external file metadata.

fields: dict[str, SummaryBlock] = dict(core=SummaryBlock(fields=[SummaryField('title', True, None, str), SummaryField('description', True, None, str)], mandatory=True, title='Core fields', singular=True), project_ids=SummaryBlock(fields=[SummaryField('safe project id', False, None, int), SummaryField('project id', False, None, int)], mandatory=True if self.projects else False, title='Project IDs', singular=False), access=SummaryBlock(fields=[SummaryField('access status', True, 'access', str), SummaryField('embargo date', False, 'embargo_date', datetime.datetime), SummaryField('access conditions', False, 'access_conditions', str)], mandatory=True, title='Access details', singular=True), keywords=SummaryBlock(fields=[SummaryField('keywords', True, None, str)], mandatory=True, title='Keywords', singular=False), doi=SummaryBlock(fields=[SummaryField('publication doi', True, None, str)], mandatory=False, title='DOI', singular=False), date=SummaryBlock(fields=[SummaryField('start date', True, None, datetime.datetime), SummaryField('end date', True, None, datetime.datetime)], mandatory=False, title='Date Extents', singular=True), geo=SummaryBlock(fields=[SummaryField('west', True, None, float), SummaryField('east', True, None, float), SummaryField('south', True, None, float), SummaryField('north', True, None, float)], mandatory=False, title='Geographic Extents', singular=True), authors=SummaryBlock(fields=[SummaryField('author name', True, 'name', str), SummaryField('author affiliation', False, 'affiliation', str), SummaryField('author email', False, 'email', str), SummaryField('author orcid', False, 'orcid', str)], mandatory=True, title='Authors', singular=False), funding=SummaryBlock(fields=[SummaryField('funding body', True, 'body', str), SummaryField('funding type', True, 'type', str), SummaryField('funding reference', False, 'ref', (str, int, float)), SummaryField('funding link', False, 'url', str)], mandatory=False, title='Funding Bodies', singular=False), external=SummaryBlock(fields=[SummaryField('external file', True, 'file', str), SummaryField('external file description', True, 'description', str)], mandatory=False, title='External Files', singular=False), worksheet=SummaryBlock(fields=[SummaryField('worksheet name', True, 'name', str), SummaryField('worksheet title', True, 'title', str), SummaryField('worksheet description', True, 'description', str), SummaryField('worksheet external file', False, 'external', str)], mandatory=False, title='Worksheets', singular=False), permits=SummaryBlock(fields=[SummaryField('permit type', True, 'type', str), SummaryField('permit authority', True, 'authority', str), SummaryField('permit number', True, 'number', (str, int, float))], mandatory=False, title='Permits', singular=False)) instance-attribute

A dictionary setting the summary blocks that can be present.

funders = None instance-attribute

A list of dictionaries of funder metadata.

keywords: list[str] instance-attribute

A list of keyword strings.

latitudinal_extent: Extent = Extent('latitudinal extent', (float, int), hard_bounds=resources.extents.latitudinal_hard_extent, soft_bounds=resources.extents.latitudinal_soft_extent) instance-attribute

Extent instance for the latitudinal extent of the Dataset.

load(worksheet, sheetnames, validate_doi=False)

Populate a Summary instance from an Excel Worksheet.

Parameters:

Name Type Description Default
worksheet Worksheet

An openpyxl worksheet instance.

required
sheetnames set

A set of sheet names found in the workbook.

required
validate_doi bool

Should publication DOIs be validated (needs web connection).

False
Source code in safedata_validator/summary.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
@loggerinfo_push_pop("Checking Summary worksheet")
def load(
    self,
    worksheet: Worksheet,
    sheetnames: set,
    validate_doi: bool = False,
) -> None:
    """Populate a Summary instance from an Excel Worksheet.

    Args:
        worksheet: An openpyxl worksheet instance.
        sheetnames: A set of sheet names found in the workbook.
        validate_doi: Should publication DOIs be validated (needs web connection).
    """
    handler = get_handler()
    start_errors = handler.counters["ERROR"]

    self.validate_doi = validate_doi

    rows = load_rows_from_worksheet(worksheet)

    self._ncols = worksheet.max_column

    # convert into dictionary using the lower cased first entry as the key after
    # checking for empty values (None) and non-string values.
    row_headers = IsString([r[0] for r in rows])
    if not row_headers:
        # Check for None separately because seeing 'None' as a field key in the
        # report is very confusing for end users.
        if None in row_headers.failed:
            LOGGER.error("Summary metadata fields column contains empty cells")
            row_headers

        # Get other non-string headers.
        non_none_failed_headers = [
            hdr for hdr in row_headers.failed if hdr is not None
        ]
        if non_none_failed_headers:
            LOGGER.error(
                "Summary metadata fields column contains non text values: ",
                extra={"join": non_none_failed_headers},
            )

    # Now we have warned about bad values, explicitly cast row keys to string to
    # continue processing.
    self._rows = {str(rw[0]).lower(): rw[1:] for rw in rows}

    # Check if metadata keys have white space padding
    self._check_for_whitespace()

    # Validate the keys found in the summary table
    self._validate_keys()

    # Now process the field blocks
    self._load_core()
    self._load_project_ids()
    self._load_access_details()
    self._load_authors()
    self._load_keywords()
    self._load_doi()
    self._load_temporal_extent()
    self._load_geographic_extent()
    self._load_funders()
    self._load_permits()
    self._load_external_files()
    self._load_data_worksheets(sheetnames)

    # summary of processing
    self.n_errors = handler.counters["ERROR"] - start_errors
    if self.n_errors > 0:
        LOGGER.info(f"Summary contains {self.n_errors} errors")
    else:
        LOGGER.info("Summary formatted correctly")

longitudinal_extent: Extent = Extent('longitudinal extent', (float, int), hard_bounds=resources.extents.longitudinal_hard_extent, soft_bounds=resources.extents.longitudinal_soft_extent) instance-attribute

Extent instance for the longitudinal extent of the Dataset.

n_errors: int = 0 instance-attribute

The number of validation errors found in the summary.

permits: list[dict] instance-attribute

A list of dictionaries of research permit metadata.

project_id: list[int] | None = None instance-attribute

A list of project ID codes, if project IDs are configured.

projects: dict[int, str] = resources.projects instance-attribute

A dictionary of valid project data.

publication_doi = None instance-attribute

A list of DOIs associated with the dataset.

temporal_extent: Extent = Extent('temporal extent', (datetime.date), hard_bounds=resources.extents.temporal_hard_extent, soft_bounds=resources.extents.temporal_soft_extent) instance-attribute

Extent instance for the temporal extent of the Dataset.

title: str instance-attribute

A string giving the dataset title.

validate_doi = False instance-attribute

A boolean flag indicating whether DOI values should be validated.

load_rows_from_worksheet(worksheet)

Load worksheet rows, removing blank rows.

Parameters:

Name Type Description Default
worksheet Worksheet

An openpyxl worksheet instance.

required
Source code in safedata_validator/summary.py
910
911
912
913
914
915
916
917
918
919
920
921
922
def load_rows_from_worksheet(worksheet: Worksheet) -> list[tuple]:
    """Load worksheet rows, removing blank rows.

    Args:
        worksheet: An openpyxl worksheet instance.
    """
    # TODO - make 'internal' blank rows an error.
    rows = []
    for this_row in worksheet.iter_rows(values_only=True):
        if not all([blank_value(vl) for vl in this_row]):
            rows.append(this_row)

    return rows