pyqt-google-image-crawler

Crawling image files from Google search result with Python and icrawler

Requirements

PyQt5 >= 5.14 - for GUI support
icrawler - main package which used for crawling
beautifulsoup4 - essential package for using icrawler

You can run this with clone this repo and install all packages with run

pip install -r requirements.txt

and run

python main.py

That's it, then you can see the result like below.

Explanation

As you see you can set parameters such as maximum length, color(including transparent), language.

For who don't understand "color" item of the color parameter - "color" means "any color" here.

Bottom portion of the window, you can add your crawling image's topic to the list. Then pressing the run button, it will keep crawling until task is over!

You can run this as a background application too. Crawling is very time-consuming job usually so i decided to support that feature

for who wants to get rid of this from foreground.

By the way i'm using icrawler in very basic way. It's not good for collecting massive amount of image, but i'm sure this can give you an idea.

After all Google Image search is one of the accessible image storage in the Internet. Even though this icrawler has some flaws.

TypeError: 'NoneType' object is not iterable

Unfortunately the code doesn't work for me.
When I run the program I get this this message:

">>
started
2023-09-03 20:31:45,500 - INFO - icrawler.crawler - start crawling...
2023-09-03 20:31:45,509 - INFO - icrawler.crawler - starting 1 feeder threads...
2023-09-03 20:31:45,517 - INFO - feeder - thread feeder-001 exit
2023-09-03 20:31:45,517 - INFO - icrawler.crawler - starting 2 parser threads...
2023-09-03 20:31:45,531 - INFO - icrawler.crawler - starting 4 downloader threads...
C:\Users\user\AppData\Roaming\Python\Python311\site-packages\urllib3\connectionpool.py:1095: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.google.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
warnings.warn(
C:\Users\user\AppData\Roaming\Python\Python311\site-packages\urllib3\connectionpool.py:1095: InsecureRequestWarning: Unverified HTTPS request is being made to host 'consent.google.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
warnings.warn(
2023-09-03 20:31:46,425 - INFO - parser - parsing result page https://www.google.com/search?q=10&ijn=0&start=0&tbs=ic%3Acolor&tbm=isch
Exception in thread parser-001:
Traceback (most recent call last):
File "C:\Program Files\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
self.run()
File "C:\Program Files\Python311\Lib\threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\icrawler\parser.py", line 94, in worker_exec
for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable
2023-09-03 20:31:47,540 - INFO - parser - no more page urls for thread parser-002 to parse
2023-09-03 20:31:47,540 - INFO - parser - thread parser-002 exit
2023-09-03 20:31:50,536 - INFO - downloader - no more download task for thread downloader-003
2023-09-03 20:31:50,537 - INFO - downloader - no more download task for thread downloader-004
2023-09-03 20:31:50,538 - INFO - downloader - no more download task for thread downloader-001
2023-09-03 20:31:50,539 - INFO - downloader - no more download task for thread downloader-002
2023-09-03 20:31:50,539 - INFO - downloader - thread downloader-003 exit
2023-09-03 20:31:50,540 - INFO - downloader - thread downloader-004 exit
2023-09-03 20:31:50,541 - INFO - downloader - thread downloader-001 exit
2023-09-03 20:31:50,542 - INFO - downloader - thread downloader-002 exit
2023-09-03 20:31:51,542 - INFO - icrawler.crawler - Crawling task done!
finished"

Can someone help?

yjg30737 / pyqt-google-image-crawler Goto Github PK