Writing a fetcher¶
Install and configure environment¶
initialize a new project from dbnomics-fetcher-cookiecutter
create a virtualenv
follow directives of robots.txt
write in a dynamic manner to ensure resilience
Main steps of a script¶
start from the skeleton of
define what is a resource
implement the function prepare_resources
implement the function process_resource
A resource must have a unique
id. Other attributes (e.g.
url, etc.) can be defined by inheriting the base class
Using a sub-directory per resource:
create_contextmethod which creates the directory
deletemethod which deletes the directory
class DaresResource(Resource): dir: Path def create_context(self): self.dir.mkdir(exist_ok=True) def delete(self): """Delete HTML file and all Excel files.""" shutil.rmtree(self.dir)
Run the fetcher¶
python download.py source-data
You should find
source-data and several files corresponding to the downloaded resources.
To display the available script options:
python download.py source-data --help
Pin dependencies versions¶
Your fetcher may use external Python packages that are installed in your virtualenv with
It is highly recommended to pin the version number of those dependencies in
requirements.txt. It is not sufficient to just mention the names of the packages in
requirements.txt, otherwise one future day someone will install them with the versions available that future day, and the packages may behave differently than those you worked with. Also, it is important to pin versions recursively.
There are several solutions in the Python community to achieve version pinning. DBnomics fetchers use pip-tools like this:
# requirements.in python-slugify ujson
pip install pip-tools pip-compile
The following file is generated:
# requirements.txt # # This file is autogenerated by pip-compile # To update, run: # # pip-compile # python-slugify==4.0.0 # via -r requirements.in text-unidecode==1.3 # via python-slugify ujson==2.0.3 # via -r requirements.in
requirements.txt must be committed.
why use asyncio?