Goodtables

Goodtables is a continuous validation service for tabular data. Internally developed by the SCC Lab, it is modelled around the service provided by the Open Knowledge Foundation at https://goodtables.io.

The service provides the users with the ability to continuously validate the format and correctness of every CSV or TSV file uploaded on the S3 data lake (backed by Minio). By simply registering their bucket with the service, Goodtables will autonomously process each file uploaded to S3 and perform a series of validation checks, producing an accurate analysis report that users will be able to check to know the results.

The idea is to reduce the effort needed to validate the files, and thus improve the quality of datasets produces and shared within the digital hub by end users. At first, the service will be served as an opt-in options, connected only to the data lake, to help scientists and users in improving the quality of data.

The second step will be a direct connection with the data catalogue CKAN, where each tabular dataset inserted will have to undergo a validation check by goodtables before being published. This way, data will be continuously checked, reducing the likelihood of malformed or incomplete files published.

Goodtables-py

The actual validation framework used by the service is goodtables-py (https://github.com/frictionlessdata/goodtables-py), a python library and toolkit specifically designed to perform a series of validation checks on tabular data, both on the structural and the content level.

The main feature of the library (from website) are:

  • Structural checks: Ensure that there are no empty rows, no blank headers, etc.
  • Content checks: Ensure that the values have the correct types (“string”, “number”, “date”, etc.), that their format is valid (“string must be an e-mail”), and that they respect the constraints (“age must be a number greater than 18”).
  • Support for multiple tabular formats: CSV, Excel files, LibreOffice, Data Package, etc.
  • Parallelized validations for multi-table datasets
  • Command line interface

The actual implementation (as of 2019) supports only structural checks, while the support for content checks depends on the adoption of a content schema which describes the required fields and their types. While goodtables-py offers an open schema format, available at https://frictionlessdata.io/specs/, it is yet to be decided how to introduce the support in the platform.

Continuous validation service

The service is designed as a backend system, developed in Java with the spring boot framework, which provides an API and an execution service able to perform various kind of checks when triggered.

While the inspiration is taken by goodtables.io (https://goodtables.io/about), we found out the implementation and the approach taken by that project too restricting and narrow focused for the adoption within the Digital Hub.

The idea of offering a self-managed portal for users, where they will be able to register their resources and have them checked, is much more useful if applied to a variety of formats and repositories than restricting to github repositories and tabular data as the original project. We envision a solution where users will be able to check both the structure and the content of resources like tabular data, geographic and spatial datasets (geojson, kml), plain json (with json schema), xml (plus xsd) etc.

To support this vision we built a custom system, where data processors and validators are decoupled from the management system and from the triggers.

The high level architecture is made of :

  • a suite of validators, mostly built on open source projects with reputable history and recognition
  • a suite of triggers able to receive an event and start the validation process on a resource
  • a suite of connectors designed to collect resources (files) from repositories and also write results
  • a repository of registrations, which represent the user request for checking a kind of file inside a specific repository
  • an API for programmatic access
  • a user interface for self-management

The backend layer connects all the various parts and also leverages AAC to obtain roles and identities, and the Vault to collect single-use credentials for resource access.

Any kind of trigger could start the validation process, which could involve one of the many validators available. The process will then proceed with the parsing of the event generated by the trigger, in an asychronous way to free the trigger thread. The result will instruct the system in the lookup of the corresponding resource definition. If one (or more) of the registrations matches the event received, the system will then validate the accessibility of the remote file, by requesting an appropriate set of credentials, and then fork a dedicated process to execute the validation script defined inside the registration. The executor will then process the validation, in an isolated thread, and eventually output the results, which will be stored inside the primary database, along with the registrations, and also written in an adequate format inside the remote storage, along with the source file, if requested.

The whole system is designed to be easily extended, by adding connectors, validators for new file types, and scaled by increasing the thread pool size for executors.

The triggers available as of end of 2019 are:

  • MQTT: the service subscribes a set of topics and listens for messages describing the upload of new files.
  • HTTP: the client uploads a file via POST request

The S3 (Minio) integration leverages the MQTT trigger.

The validators available are:

  • Goodtables-py for CSV,TSV
  • JSON (without schema)

Validation API

Goodtables exposes a complete API where users and developers will be able to:

  • register new repositories and require checks
  • review the validation results
  • perform validation checks on files via upload

The details of the API are not finalized.

User interface

The service offers a self-management portal for end-users, which enables them into autonomously registering and reviewing validation tasks within any of the repositories available on Goodtables.

The authentication and authorization steps are performed via AAC.