API Reference

Scraper

This module carries out an efficient scrape of the WIT dataset using multiprocessing and minimal PNG chunk header handling.


wikitransp.scraper.scrape_images(sample=False, resume_at=None, resume_after=None, decompress_tsv=False, fetch_async=True)[source]

Build a local dataset by scanning the WIT datatset (or a small sample of it) for suitable PNGs. Note: only pass one of resume_at or resume_after.

Parameters
  • sample – Whether to only scrape the 1% sample dataset

  • resume_at – The image URL to resume at (if scraping was interrupted).

  • resume_at – The image URL to resume after (if scraping was interrupted).

  • decompress_tsv – Whether to decompress gzipped TSVs before filtering (not necessary, and will increase dataset file size on disk).

Package data

Provides data to the scraper.