Giter Site home page Giter Site logo

amitupreti / hands-on-webscraping Goto Github PK

View Code? Open in Web Editor NEW
83.0 11.0 76.0 76 KB

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using advanced protection.

License: MIT License

Python 100.00%
python nodejs scrapy puppeteer apify requests crawler

hands-on-webscraping's Introduction

Hands-on-WebScraping (NO LONGER MAINTAINED)

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using advanced protection.

hands-on-webscraping's People

Contributors

amitupreti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hands-on-webscraping's Issues

Issues installing libraries in python 3.8

Initially, I got an error for dateutil not having a valid version for my current install, then I read that python-dateutil should be a subsitute, but when installing that I got a failure. Anyone able to get this working in 3.8?

unable to import get_links

$ scrapy list Traceback (most recent call last): File "/home/iseadmin/anaconda3/bin/scrapy", line 10, in <module> sys.exit(execute()) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute cmd.crawler_process = CrawlerProcess(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 280, in __init__ super(CrawlerProcess, self).__init__(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 152, in __init__ self.spider_loader = self._get_spider_loader(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/crawler.py", line 146, in _get_spider_loader return loader_cls.from_settings(settings.frozencopy()) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 60, in from_settings return cls(settings) File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 24, in __init__ self._load_all_spiders() File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/spiderloader.py", line 46, in _load_all_spiders for module in walk_modules(name): File "/home/iseadmin/anaconda3/lib/python3.6/site-packages/scrapy/utils/misc.py", line 77, in walk_modules submod = import_module(fullpath) File "/home/iseadmin/anaconda3/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_import File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 665, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/iseadmin/Desktop/nlp-project/Hands-on-WebScraping/project1_twitter_hashtag_crawler/TwitterHashTagCrawler/spiders/hashtag.py", line 8, in <module> from utils import get_links, get_hashtags, get_mentions ImportError: cannot import name 'get_links'

crawling by time periods

Hi Amit, great cralwer!!

well done :)

Is it possible to add to the crawler the ability to crawl specific periods? right now, its running perfectly and crawl mostly 2020. Is there a way to add time based filter (for example for 2019)?
I wasnt sure where to add it in the code. If you can guid me where, it would be great!

Thanks,
Neta

scrapy list command doesnt work

all the requirements were successfully installed but 'scrapy list' command didnt work giving the error "'scrapy' is not recognized as an internal or external command,
operable program or batch file."

scrapy list

HTTP Status Code Is Not Handled Or Not Allowed

Uh oh...did Twitter break us?
Do we have the change the user_agent in settings.py?

<021-09-09 15:34:55 [scrapy.core.engine] INFO: Spider opened
2021-09-09 15:34:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-09 15:34:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-09-09 15:34:55 [root] INFO: 3 hashtags found
2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/cats>: HTTP status code is not handled or not allowed
2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/dogs>: HTTP status code is not handled or not allowed
2021-09-09 15:34:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://mobile.twitter.com/hashtag/hello>: HTTP status code is not handled or not allowed
2021-09-09 15:34:55 [scrapy.core.engine] INFO: Closing spider (finished)>

No tweets are being scraped.

Hashtags are found, but it doesn`t find any tweets. I have lowerd the setting (delay and concurrency) and set ROBOTSTXT_OBEY to false. Any tips?

Isn't Working?

This scraper had been working for me until today. Have anyone had the same problem or is only happening to me?

Thank you very much

No longer scraping past the first page.

Hello my friend. I read about your tool on medium and I must say it's very good, I've been in love with it. However, I came across a small problem which I just couldn't solve on my own. After trying to scrape for a single hashtag, it's only scraping the first 20 tweets, aparently because it's not able to fetch the next page. I'm using it on Windows 7, with Python properly set up and all it's dependencies. I suspect it might be due to some update on Twitter's end, but I'm not sure. Any help?
Thanks in advance.
image

Pulling All Tweets

Hey, quick question. When I ran this using that hashtag, BigData, it pulled all tweets containing the words data or big data. Why is it not only pulling tweets with the hashtag BigData?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.