Writing a fetcher¶
Install and configure environment¶
create a virtualenv
install dependencies
create
source-data
andjson-data
directories
Good practices¶
follow directives of robots.txt
write in a dynamic manner to ensure resilience
Main steps of a script¶
start from the skeleton of
download.py
orconvert.py
based on available examplesdefine what is a resource
implement the function prepare_resources
implement the function process_resource
A resource must have a unique id
. Other attributes (e.g. file
, url
, etc.) can be defined by inheriting the base class dbnomics_fetcher_toolbox.resources.Resource
.
Using a sub-directory per resource:
implement
create_context
method which creates the directoryimplement
delete
method which deletes the directory
class DaresResource(Resource):
dir: Path
def create_context(self):
self.dir.mkdir(exist_ok=True)
def delete(self):
"""Delete HTML file and all Excel files."""
shutil.rmtree(self.dir)
Run the fetcher¶
python download.py source-data
You should find status.jsonl
in source-data
and several files corresponding to the downloaded resources.
To display the available script options:
python download.py source-data --help
Pin dependencies versions¶
Your fetcher may use external Python packages that are installed in your virtualenv with pip install
.
It is highly recommended to pin the version number of those dependencies in requirements.txt
. It is not sufficient to just mention the names of the packages in requirements.txt
, otherwise one future day someone will install them with the versions available that future day, and the packages may behave differently than those you worked with. Also, it is important to pin versions recursively.
There are several solutions in the Python community to achieve version pinning. DBnomics fetchers use pip-tools like this:
# requirements.in
python-slugify
ujson
pip install pip-tools
pip-compile
The following file is generated:
# requirements.txt
#
# This file is autogenerated by pip-compile
# To update, run:
#
# pip-compile
#
python-slugify==4.0.0 # via -r requirements.in
text-unidecode==1.3 # via python-slugify
ujson==2.0.3 # via -r requirements.in
Both requirements.in
and requirements.txt
must be committed.
Technical details¶
why use asyncio?