# Writing a fetcher ## Install and configure environment * create a virtualenv * install dependencies * create `source-data` and `json-data` directories ## Good practices * follow directives of robots.txt * write in a dynamic manner to ensure resilience ## Main steps of a script * start from the skeleton of `download.py` or `convert.py` based on [available examples](https://git.nomics.world/dbnomics/dbnomics-fetcher-toolbox/-/tree/master/examples) * define what is a resource * implement the function prepare_resources * implement the function process_resource A resource must have a unique `id`. Other attributes (e.g. `file`, `url`, etc.) can be defined by inheriting the base class `dbnomics_fetcher_toolbox.resources.Resource`. Using a sub-directory per resource: * implement `create_context` method which creates the directory * implement `delete` method which deletes the directory ```python class DaresResource(Resource): dir: Path def create_context(self): self.dir.mkdir(exist_ok=True) def delete(self): """Delete HTML file and all Excel files.""" shutil.rmtree(self.dir) ``` ## Run the fetcher ```bash python download.py source-data ``` You should find `status.jsonl` in `source-data` and several files corresponding to the downloaded resources. To display the available script options: ```bash python download.py source-data --help ``` ## Pin dependencies versions Your fetcher may use external Python packages that are installed in your virtualenv with `pip install`. It is highly recommended to pin the version number of those dependencies in `requirements.txt`. It is not sufficient to just mention the names of the packages in `requirements.txt`, otherwise one future day someone will install them with the versions available that future day, and the packages may behave differently than those you worked with. Also, it is important to pin versions recursively. There are several solutions in the Python community to achieve version pinning. DBnomics fetchers use [pip-tools](https://github.com/jazzband/pip-tools) like this: ``` # requirements.in python-slugify ujson ``` ```bash pip install pip-tools pip-compile ``` The following file is generated: ``` # requirements.txt # # This file is autogenerated by pip-compile # To update, run: # # pip-compile # python-slugify==4.0.0 # via -r requirements.in text-unidecode==1.3 # via python-slugify ujson==2.0.3 # via -r requirements.in ``` Both `requirements.in` and `requirements.txt` must be committed. ## Technical details * why use asyncio?