Skip to content

The resources submodule

The safedata_validator package needs access to some local resources and configuration to work. The core resources for file validation are:

  • gazetteer: A path to a GeoJSON formatted gazetteer of known locations and their details.

  • location_aliases: A path to a CSV file containing known aliases of the location names provided in the gazetteer.

  • gbif_database: The path to a local SQLite copy of the GBIF backbone database.

  • ncbi_database: The path to a local SQLite copy of the NCBI database files.

  • project_database: Optionally, a path to a CSV file providing valid project IDs.

The Resources class is used to locate and validate these resources, and then provide those validated resources to other components of the package.

A configuration file can be passed as config when creating an instance, but if no arguments are provided then an attempt is made to find and load configuration files in the user and then site config locations defined by the appdirs package. See here for details.

Resources

Load and check validation resources.

Creating an instance of this class locates and validate resources for using the safedata_validator package, either from the provided configuration details or from the user and then site config locations defined by the appdirs package.

Parameters:

Name Type Description Default
config str | list | dict | None

A path to a configuration file, or a dict or list providing package configuration details. The list format should provide a list of strings, each representing a line in the configuration file. The dict format is a dictionary with the required nested dictionary structure and values.

None

Attributes:

Name Type Description
config_type

The method used to specify the resources. One of 'init_dict', 'init_list', 'init_file', 'user_config' or 'site_config'.

gazetteer

The path to the gazetteer file

location_aliases dict[str, str]

The path to the location_aliases file

gbif_database

The path to the GBIF database file

ncbi_database

The path to the NCBI database file

project_database

An optional path to a database of valid project IDs

valid_locations

The locations defined in the locations file

location_aliases dict[str, str]

Location aliases defined in the locations file

extents

A DotMap of extent data

zenodo

A DotMap of Zenodo information

Source code in safedata_validator/resources.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
@loggerinfo_push_pop("Configuring Resources")
class Resources:
    """Load and check validation resources.

    Creating an instance of this class locates and validate resources for using the
    `safedata_validator` package, either from the provided configuration details or from
    the user and then site config locations defined by the appdirs package.

    Args:
        config:
            A path to a configuration file, or a dict or list providing package
            configuration details. The list format should provide a list of strings,
            each representing a line in the configuration file. The dict format is a
            dictionary with the required nested dictionary structure and values.

    Attributes:
        config_type: The method used to specify the resources. One of
            'init_dict', 'init_list', 'init_file', 'user_config' or 'site_config'.
        gazetteer: The path to the gazetteer file
        location_aliases: The path to the location_aliases file
        gbif_database: The path to the GBIF database file
        ncbi_database: The path to the NCBI database file
        project_database: An optional path to a database of valid project IDs
        valid_locations: The locations defined in the locations file
        location_aliases: Location aliases defined in the locations file
        extents: A DotMap of extent data
        zenodo: A DotMap of Zenodo information
    """

    def __init__(self, config: str | list | dict | None = None) -> None:
        # User and site config paths
        user_cfg_file = os.path.join(
            appdirs.user_config_dir(), "safedata_validator", "safedata_validator.cfg"
        )
        site_cfg_file = os.path.join(
            appdirs.site_config_dir(), "safedata_validator", "safedata_validator.cfg"
        )

        # First try and populate from a config file.
        if config is not None:
            if isinstance(config, str):
                if os.path.exists(config) and os.path.isfile(config):
                    config_type = "init file"
                else:
                    log_and_raise(f"Config file path not found: {config}", RuntimeError)
                    return
            elif isinstance(config, list):
                config_type = "init list"
            elif isinstance(config, dict):
                config_type = "init dict"
        elif os.path.exists(user_cfg_file) and os.path.isfile(user_cfg_file):
            config = user_cfg_file
            config_type = "user file"
        elif os.path.exists(site_cfg_file) and os.path.isfile(site_cfg_file):
            config = site_cfg_file
            config_type = "site file"
        else:
            LOGGER.critical(f"No user config in {user_cfg_file}")
            LOGGER.critical(f"No site config in {site_cfg_file}")
            log_and_raise("No config files provided or found", RuntimeError)
            return

        # Report resource config location and type
        msg = f"Configuring resources from {config_type}"
        if "file" in config_type:
            msg += f": {config}"
        LOGGER.info(msg)

        # Try and load the found configuration
        config_loaded = self._load_config(config, config_type)

        # Set attributes -
        # HACK - this now seems clumsy - the ConfigObj instance is already a
        #        class containing the config attributes. Having a _function_
        #        that returns a modified ConfigObj instance seems more direct
        #        than having to patch this list of attributes.
        self.gaz_path = config_loaded.gazetteer
        self.localias_path = config_loaded.location_aliases
        self.gbif_database = config_loaded.gbif_database
        self.ncbi_database = config_loaded.ncbi_database
        self.project_database = (
            None
            if config_loaded.project_database == ""
            else config_loaded.project_database
        )
        self.config_type = config_loaded.config_type
        self.config_source = config_loaded.config_source

        self.extents = config_loaded.extents
        self.zenodo = config_loaded.zenodo
        self.metadata = config_loaded.metadata
        self.xml = config_loaded.xml

        self.gbif_timestamp: str | None = None
        self.ncbi_timestamp: str | None = None

        # Valid locations is a dictionary keying string location names to tuples of
        # floats describing the location bounding box
        self.valid_location: dict[str, list[float]] = dict()
        # Location aliases is a dictionary keying a string to a key in valid locations
        self.location_aliases: dict[str, str] = dict()
        # Projects are a dictionary keying project ID to a title.
        self.projects: dict[int, str] = dict()

        # Validate the resources
        self._validate_gazetteer()
        self._validate_location_aliases()
        self._validate_gbif()
        self._validate_ncbi()
        self._validate_projects()

    @staticmethod
    def _load_config(config: str | list | dict, cfg_type: str) -> DotMap:
        """Load a configuration file.

        This private static method attempts to load a JSON configuration file
        from a path.

        Args:
            config: Passed from Resources.__init__()
            cfg_type: Identifies the route used to provide the configuration details

        Raises:
            RuntimeError: If the file does not exist, or has issues.

        Returns:
            Returns a DotMap of config parameters.
        """

        # Otherwise, there is a file, so try and use it and now raise if there
        # is a problem: don't skip over broken resource configurations.
        # - First, create a validator instance that handles lists of dates
        cf_validator = Validator({"date_list": date_list})

        # - Now load the config input and then apply the basic validation - are
        #   the values of the right type, right count etc.
        config_obj = ConfigObj(config, configspec=CONFIGSPEC)
        valid = config_obj.validate(cf_validator, preserve_errors=True)

        # If there are config file issues, then bail out.
        if isinstance(valid, dict):
            LOGGER.critical("Configuration issues: ")
            FORMATTER.push()
            for sec, key, err in flatten_errors(config_obj, valid):
                sec.append(key)
                LOGGER.critical(f"In config '{'.'.join(sec)}': {err}")
            FORMATTER.pop()
            raise RuntimeError("Configuration failure")

        # convert to a DotMap for ease
        config_obj = DotMap(config_obj)

        return config_obj

    def _validate_gazetteer(self) -> None:
        """Validate and load a gazetteer file.

        This private function checks whether a gazetteer path: exists, is a JSON file,
        and contains location GeoJSON data. It populates the instance attributes
        """

        if self.gaz_path is None or self.gaz_path == "":
            log_and_raise("Gazetteer file missing in configuration", RuntimeError)

        LOGGER.info(f"Validating gazetteer: {self.gaz_path}")

        # Now check to see whether the locations file behaves as expected
        if not os.path.exists(self.gaz_path) and not os.path.isfile(self.gaz_path):
            log_and_raise("Gazetteer file not found", OSError)

        try:
            loc_payload = simplejson.load(open(self.gaz_path))
        except (JSONDecodeError, UnicodeDecodeError):
            log_and_raise("Gazetteer file not valid JSON", OSError)

        # Simple test for GeoJSON
        if (
            loc_payload.get("type") is None
            or loc_payload["type"] != "FeatureCollection"
        ):
            log_and_raise(
                "Gazetteer data not a GeoJSON Feature Collection", RuntimeError
            )

        try:
            self.valid_locations = {
                ft["properties"]["location"]: shape(ft["geometry"]).bounds
                for ft in loc_payload["features"]
            }
        except KeyError:
            log_and_raise(
                "Missing or incomplete location properties for gazetteer features",
                RuntimeError,
            )

    def _validate_location_aliases(self) -> None:
        """Validate and load location aliases.

        This private function checks whether a location_aliases path: exists, is a CSV
        file, and contains location_alias data. It populates the instance attributes
        """

        if self.localias_path is None or self.localias_path == "":
            log_and_raise(
                "Location aliases file missing in configuration", RuntimeError
            )

        LOGGER.info(f"Validating location aliases: {self.localias_path}")

        # Now check to see whether the locations file behaves as expected
        try:
            dictr = DictReader(open(self.localias_path))
        except FileNotFoundError:
            log_and_raise("Location aliases file not found", FileNotFoundError)
        except IsADirectoryError:
            log_and_raise("Location aliases path is a directory", IsADirectoryError)

        # Simple test for structure - field names only parsed when called, and this can
        # throw errors with bad file formats.
        try:
            if not dictr.fieldnames:
                log_and_raise("Location aliases file is empty", ValueError)
            else:
                fieldnames = set(dictr.fieldnames)
        except (UnicodeDecodeError, csvError):
            log_and_raise(
                "Location aliases file not readable as a CSV file with valid headers",
                ValueError,
            )

        if fieldnames != {"zenodo_record_id", "location", "alias"}:
            log_and_raise(
                "Location aliases file not readable as a CSV file with valid headers",
                ValueError,
            )

        # TODO - zenodo_record_id not being used here.
        self.location_aliases = {la["alias"]: la["location"] for la in dictr}

    def _validate_gbif(self) -> None:
        """Validate the GBIF settings.

        This private function validates the provided sqlite3 database file and updates
        the instance with validated details.
        """

        self.gbif_timestamp = validate_taxon_db(
            self.gbif_database, "GBIF", ["backbone"]
        )

    def _validate_ncbi(self) -> None:
        """Validate the NCBI settings.

        This private function validates the provided sqlite3 database files and updates
        the instance with validated details.
        """

        self.ncbi_timestamp = validate_taxon_db(
            self.ncbi_database, "NCBI", ["nodes", "merge", "names"]
        )

    def _validate_projects(self) -> None:
        """Validate and load a project database.

        This private function checks whether a project_database path: exists, is a CSV
        file, and contains project data. It populates the instance ``project_id``
        attribute.
        """

        if self.project_database is None:
            LOGGER.info("Configuration does not use project IDs.")
            return

        LOGGER.info(f"Validating project database: {self.project_database}")

        # Now check to see whether the project database behaves as expected
        try:
            dictr = DictReader(open(self.project_database, encoding="UTF-8"))
        except FileNotFoundError:
            log_and_raise("Project database file not found", FileNotFoundError)
        except IsADirectoryError:
            log_and_raise("Project database path is a directory", IsADirectoryError)

        # Simple test for structure - field names only parsed when called, and this can
        # throw errors with bad file formats.
        try:
            if not dictr.fieldnames:
                log_and_raise("Project database file is empty", ValueError)
            else:
                fieldnames = set(dictr.fieldnames)
        except (UnicodeDecodeError, csvError) as excep:
            LOGGER.critical(
                "Project database file not readable as a CSV file with valid headers"
            )
            raise excep

        required_names = {"project_id", "title"}
        if required_names.intersection(fieldnames) != required_names:
            log_and_raise(
                "Project database file does not contain project_id and title headers.",
                ValueError,
            )

        # Load the valid project ids
        try:
            self.projects = {int(prj["project_id"]): str(prj["title"]) for prj in dictr}
        except ValueError:
            log_and_raise(
                "Project database file values not integer IDs and text titles.",
                ValueError,
            )

CONFIGSPEC = {'gazetteer': 'string()', 'location_aliases': 'string()', 'gbif_database': 'string()', 'ncbi_database': 'string()', 'project_database': 'string(default=None)', 'extents': {'temporal_soft_extent': 'date_list(min=2, max=2, default=None)', 'temporal_hard_extent': 'date_list(min=2, max=2, default=None)', 'latitudinal_hard_extent': 'float_list(min=2, max=2, default=list(-90, 90))', 'latitudinal_soft_extent': 'float_list(min=2, max=2, default=None)', 'longitudinal_hard_extent': 'float_list(min=2, max=2, default=list(-180, 180))', 'longitudinal_soft_extent': 'float_list(min=2, max=2, default=None)'}, 'zenodo': {'community_name': 'string(default=safe)', 'use_sandbox': 'boolean(default=None)', 'zenodo_sandbox_api': 'string(default=None)', 'zenodo_sandbox_token': 'string(default=None)', 'zenodo_api': 'string(default=None)', 'zenodo_token': 'string(default=None)', 'contact_name': 'string(default=None)', 'contact_affiliation': 'string(default=None)', 'contact_orcid': 'string(default=None)'}, 'metadata': {'api': 'string(default=None)', 'token': 'string(default=None)', 'ssl_verify': 'boolean(default=True)'}, 'xml': {'languageCode': 'string(default=None)', 'characterSet': 'string(default=None)', 'contactCountry': 'string(default=None)', 'contactEmail': 'string(default=None)', 'epsgCode': 'integer(default=4326)', 'projectURL': 'string(default=None)', 'topicCategories': 'string_list(default=None)', 'lineageStatement': 'string(default=None)'}} module-attribute

dict: The safedata_validator package use the configobj.ConfigObj package to handle resource configuration. This dict defines the basic expected specification for the configuration and allows the ConfigObj.validate() method to do basic validation and type conversions.

date_list(value, min, max)

Validate config date lists.

A configobj.Validator extension function to check configuration values containing a list of ISO formatted date strings and to return parsed values.

Parameters:

Name Type Description Default
value str

A string containing comma-separated ISO date strings

required
min str

The minimum allowed number of entries

required
max str

The maximum allowed number of entries

required
Source code in safedata_validator/resources.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def date_list(value: str, min: str, max: str) -> list[date]:
    """Validate config date lists.

    A configobj.Validator extension function to check configuration values
    containing a list of ISO formatted date strings and to return parsed
    values.

    Args:
        value: A string containing comma-separated ISO date strings
        min: The minimum allowed number of entries
        max: The maximum allowed number of entries
    """
    # min and max are supplied as a string, test conversion to int
    try:
        min_int = int(min)
    except ValueError:
        raise VdtParamError("min", min)
    try:
        max_int = int(max)
    except ValueError:
        raise VdtParamError("max", max)

    # Check the supplied value is a list, triggering any issues
    # with list formatting
    value = is_list(value, min=min_int, max=max_int)

    # Next, check every member in the list is an ISO date string
    # noting that this strips out time information
    out = []
    for entry in value:
        try:
            parsed_entry = isoparse(entry).date()
        except ValueError:
            raise VdtValueError(entry)

        out.append(parsed_entry)

    # Return parse values
    return out

validate_taxon_db(db_path, db_name, tables)

Validate a local taxon database file.

This helper function validates that a given path contains a valid taxonomy database:

  • the required tables are all present, automatically including the timestamp table.
  • the timestamp table contains a single ISO format date showing the database version.

Parameters:

Name Type Description Default
db_path str

Location of the SQLite3 database.

required
db_name str

A label for the taxonomy database - used in logger messages.

required
tables list[str]

A list of table names expected to be present in the database.

required

Returns:

Type Description
str

The database timestamp as an ISO formatted date string.

Source code in safedata_validator/resources.py
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
def validate_taxon_db(db_path: str, db_name: str, tables: list[str]) -> str:
    """Validate a local taxon database file.

    This helper function validates that a given path contains a valid taxonomy database:

    - the required tables are all present, automatically including the timestamp table.
    - the timestamp table contains a single ISO format date showing the database
      version.

    Args:
        db_path: Location of the SQLite3 database.
        db_name: A label for the taxonomy database - used in logger messages.
        tables: A list of table names expected to be present in the database.

    Returns:
        The database timestamp as an ISO formatted date string.
    """

    LOGGER.info(f"Validating {db_name} database: {db_path}")

    if db_path is None or db_path == "":
        log_and_raise(f"{db_name} database not set in configuration", ValueError)

    # Does the provided path exist and is it a functional SQLite database
    # with a backbone table? Because sqlite3 can 'connect' to any path,
    # use a query attempt to reveal exceptions

    if not os.path.exists(db_path):
        log_and_raise(f"{db_name} database not found", FileNotFoundError)

    # Connect to the file (which might or might not be a database containing the
    # required tables)
    with contextlib.closing(sqlite3.connect(db_path)) as conn:
        # Check that it is a database by running a query
        try:
            db_tables = conn.execute(
                "SELECT name FROM sqlite_master WHERE type ='table';"
            )
        except sqlite3.DatabaseError:
            log_and_raise(f"Local {db_name} database not an SQLite3 file", ValueError)

        # Check the required tables against found tables
        db_tables_set = {rw[0] for rw in db_tables.fetchall()}
        required_tables = set(tables + ["timestamp"])
        missing = required_tables.difference(db_tables_set)

        if missing:
            log_and_raise(
                f"Local {db_name} database does not contain required tables: ",
                ValueError,
                extra={"join": missing},
            )

        # Check the timestamp table contains a single ISO date
        cursor = conn.execute("select * from timestamp;")
        timestamp = cursor.fetchall()

    # Is there one unique date in the table
    if len(timestamp) != 1:
        log_and_raise(
            f"Local {db_name} database timestamp table contains more than one entry.",
            RuntimeError,
        )

    try:
        # Extract first entry in first row
        timestamp_entry = timestamp[0][0]
        isoparse(timestamp_entry)
    except ValueError:
        log_and_raise(
            f"Local {db_name} database timestamp value is not an ISO date.",
            RuntimeError,
        )

    return timestamp_entry