Created in collaboration with Suchit Jain
The code scrapes images and captions from CommonCrawl web archives. Note that images are only downloaded if they are associated with a significant amount of text in English. The images are also filtered by:
- Image resolution
- Caption length
- Image-caption matching
- Number of recognizable English words
- Presence of NSFW content
The library uses PyTorch, gzip, numpy, warcio, BeautifulSoup4, NLTK, NudeNet, OpenCV, Pandas, and wget.
After installing the required libraries, run:
python main.py
To configure download and filtering parameters, modify parameters in lib/config.py