Design goals¶

This section is a guide explaining how to write a fetcher for DBnomics, or maintaining an existing one. Let’s dive into the different tasks that a fetcher has to do, the constraints it has to follow, and how it fits into DBnomics architecture.

Be self-contained¶

A fetcher must be able to run independently from any infrastructure. In order to achieve this, fetchers just write data to the file-system.

This allows anyone to run it without having to run the complete DBnomics infrastructure.

Store provider data as-is¶

Fetchers download data from the provider infrastructure and write it to the file-system as-is.

Providers usually distribute data as:

static files (sometimes called bulk download): XML, JSON, CSV, XLSX, sometime archived in ZIP files
web API, with responses being XML, JSON, etc.

File formats can be:

machine-readable: XML, JSON, CSV
human-readable: XLSX files using formatting, colors, etc.

Convert data to DBnomics data model¶

Fetchers convert downloaded data to a common data model. This allows the DBnomics platform to index data coming from all providers in a full-text search engine, and become an aggregator.

Keep past revisions¶

Most of the time, providers do not give access to the past revisions of data. However it is often important to access them for reproducibility, for example to run computations that were written in the past, with the data that was available at that time.

Fetchers rely on Git to handle revisions.

Avoid false revisions¶

Downloaded data sometimes differs sightly from one download to another, even if both downloads correspond to the same revision.

For example, there can be a prepared_at date in an XML file, or a random URL to a CSS stylesheet in an HTML file used to bypass the browser cache.

Keeping them would create false revisions, so fetchers are allowed to remove those specificities in downloaded data in order to avoid them.

Ensure revision consistency¶

During the download or the conversion of data, some errors can occur. An error can lead to the situation where a dataset is partially written.

In that case, fetchers have to cancel writing the dataset, otherwise this would create a partial revision in which some data is updated while the rest comes from the previous revision.

Process resources¶

Providers distribute data in various ways.

For example, here are many possible cases:

a CSV file defining a whole dataset (1 to 1 relationship)
an XLSX file defining many datasets (1 to many relationship)
many XML files defining a dataset (many to 1 relationship)
many files defining many datasets (many to many relationship)

In order to reason more easily about those different data granularities, the fetcher toolbox introduces the notion of resource.

In the previous example, the resource can be a file, or a group of files.

Fetcher authors can choose the scope of resources, based on their understanding of provider data.

Error handling¶

Errors may occur during the processing of resources.

In such case this error should not break the entire script execution. Th error should be logged and the next resource should start being processed. The script should not fail immediately by raising an exception.

Data generated by a script is written to the target directory. In case of error, data is kept but could be corrupt or incomplete. In development, this allows the fetcher author to inspect the situation. In production, that corrupt data should be removed.

For example, a download script may fail downloading a resource because the server is down or slow, or a convert script may fail converting a resource because data is different than expected for that resource.

The fetcher toolbox takes care of handling the error (logging it) and keeps on processing the next resource.

This default behavior can be modified by using script options like --fail-fast, which makes the script fail by raising an exception.

Also the option --delete-on-error allows to delete data when a resource fails processing. This requires implementing the Resource.delete method.

Have a report on process execution¶

During the processing of resources, a script creates events describing the resulting state for each resource.

The fetcher toolbox provides functions reading and writing such events.

By default, a status.jsonl JSON Lines file is written to the target directory.

Each event contains the start datetime, the resulting status (SUCCESS, FAILURE, SKIPPED), and other metadata.

If a status file exists and a script is run, previous events will be loaded, and already processed resources won’t be processed again.

This default behavior can be modified by using script options like --force or --retry-failed.

Scheduling¶

Fetchers are scheduled on a regular basis in order to keep DBnomics data up to date.