Writing a fetcher

Install and configure environment

  • initialize a new project from dbnomics-fetcher-cookiecutter

  • create a virtualenv

  • install dependencies

  • create source-data and json-data directories

Good practices

  • follow directives of robots.txt

  • write in a dynamic manner to ensure resilience

Main steps of a script

  • start from the skeleton of download.py or convert.py

  • define what is a resource

  • implement the function prepare_resources

  • implement the function process_resource

A resource must have a unique id. Other attributes (e.g. file, url, etc.) can be defined by inheriting the base class dbnomics_fetcher_toolbox.resources.Resource.

Using a sub-directory per resource:

  • implement create_context method which creates the directory

  • implement delete method which deletes the directory

class DaresResource(Resource):
    dir: Path

    def create_context(self):
        self.dir.mkdir(exist_ok=True)

    def delete(self):
        """Delete HTML file and all Excel files."""
        shutil.rmtree(self.dir)

Run the fetcher

python download.py source-data

You should find status.jsonl in source-data and several files corresponding to the downloaded resources.

To display the available script options:

python download.py source-data --help

Pin dependencies versions

Your fetcher may use external Python packages that are installed in your virtualenv with pip install.

It is highly recommended to pin the version number of those dependencies in requirements.txt. It is not sufficient to just mention the names of the packages in requirements.txt, otherwise one future day someone will install them with the versions available that future day, and the packages may behave differently than those you worked with. Also, it is important to pin versions recursively.

There are several solutions in the Python community to achieve version pinning. DBnomics fetchers use pip-tools like this:

# requirements.in
python-slugify
ujson
pip install pip-tools
pip-compile

The following file is generated:

# requirements.txt
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile
#
python-slugify==4.0.0     # via -r requirements.in
text-unidecode==1.3       # via python-slugify
ujson==2.0.3              # via -r requirements.in

Both requirements.in and requirements.txt must be committed.

Technical details

  • why use asyncio?