amitupreti / hands-on-webscraping Goto Github PK

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using advanced protection.

License: MIT License

Python 100.00%

python nodejs scrapy puppeteer apify requests crawler

hands-on-webscraping's Introduction

Hands-on-WebScraping (NO LONGER MAINTAINED)

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using advanced protection.

hands-on-webscraping's People

Contributors

Stargazers

Watchers

Forkers

yoyonel shaiful-hisham vish-xo shrimantham dipesh517 godfredakpan grayrhinos ablaha isafos frontenddoctor shikinsamat divaka jk2604 hieund12 roy601912008 sneha8595 leelawesome skirwa chaitanya4v crankbrother helenjiang21 davidmfry frangamezruiz farhanjusoh manlyjpanda talverdi erikaris aithurburn seantaysl vanezlee nolll77 suphanutn navergoni rhondaregister irfanaslam-me c0a3bd acesgrandpa diogowatson leewaygroups 23-deepti yaisah daalen03 sorxplays rababalkhalifa keant1 coolsai jancuch poobalan1210 ivanhhu srikanthbaride mengyuan007 budimm jailukanna shaswat-1309 rubiel1 agautherin blue-vision0 cdi1001 4u4life aleph-works ykshih88 alexdelargy kundan001 maladeep lcluzel saitheja1911 vandong2007 zahidsqldba07 5l1v3r1 benjack1411 manman-luo ragothamang dewcxiv codemasterdevops421 ravi6389

hands-on-webscraping's Issues

Issues installing libraries in python 3.8

Initially, I got an error for dateutil not having a valid version for my current install, then I read that python-dateutil should be a subsitute, but when installing that I got a failure. Anyone able to get this working in 3.8?

unable to import get_links

$ scrapy list Traceback (most recent call last): File "/home/iseadmin/anaconda3/bin/scrapy", line 10, in <module> sys.exit(execute()) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute cmd.crawler_process = CrawlerProcess(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 280, in __init__ super(CrawlerProcess, self).__init__(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 152, in __init__ self.spider_loader = self._get_spider_loader(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 146, in _get_spider_loader return loader_cls.from_settings(settings.frozencopy()) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 60, in from_settings return cls(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 24, in __init__ self._load_all_spiders() File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 46, in _load_all_spiders for module in walk_modules(name): File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/utils/misc.py", line 77, in walk_modules submod = import_module(fullpath) File "/home/iseadmin/anaconda3/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_import File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 665, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/iseadmin/Desktop/nlp-project/Hands-on-WebScraping/project1_twitter_hashtag_crawler/TwitterHashTagCrawler/spiders/hashtag.py", line 8, in <module> from utils import get_links, get_hashtags, get_mentions ImportError: cannot import name 'get_links'

crawling by time periods

Hi Amit, great cralwer!!

well done :)

Is it possible to add to the crawler the ability to crawl specific periods? right now, its running perfectly and crawl mostly 2020. Is there a way to add time based filter (for example for 2019)?
I wasnt sure where to add it in the code. If you can guid me where, it would be great!

Thanks,
Neta

2020-09-17 22:42:08 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mobile.twitter.com/hashtag/>: HTTP status code is not handled or not allowed

scrapy list command doesnt work

all the requirements were successfully installed but 'scrapy list' command didnt work giving the error "'scrapy' is not recognized as an internal or external command,
operable program or batch file."

pip install -r requirements.txt --user issue

I believe the correct dependency name is python-dateutil, not dateutil.

HTTP Status Code Is Not Handled Or Not Allowed

Uh oh...did Twitter break us?
Do we have the change the user_agent in settings.py?

<021-09-09 15:34:55 [scrapy.core.engine] INFO: Spider opened
2021-09-09 15:34:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-09 15:34:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-09-09 15:34:55 [root] INFO: 3 hashtags found
2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/cats>: HTTP status code is not handled or not allowed
2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/dogs>: HTTP status code is not handled or not allowed
2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/hello>: HTTP status code is not handled or not allowed
2021-09-09 15:34:55 [scrapy.core.engine] INFO: Closing spider (finished)>

No tweets are being scraped.

Hashtags are found, but it doesn`t find any tweets. I have lowerd the setting (delay and concurrency) and set ROBOTSTXT_OBEY to false. Any tips?

Isn't Working?

This scraper had been working for me until today. Have anyone had the same problem or is only happening to me?

Thank you very much

No longer scraping past the first page.

Hello my friend. I read about your tool on medium and I must say it's very good, I've been in love with it. However, I came across a small problem which I just couldn't solve on my own. After trying to scrape for a single hashtag, it's only scraping the first 20 tweets, aparently because it's not able to fetch the next page. I'm using it on Windows 7, with Python properly set up and all it's dependencies. I suspect it might be due to some update on Twitter's end, but I'm not sure. Any help?
Thanks in advance.

Integrate Dozent into this Repository

Pulling All Tweets

Hey, quick question. When I ran this using that hashtag, BigData, it pulled all tweets containing the words data or big data. Why is it not only pulling tweets with the hashtag BigData?