pmyteh / risjbot Goto Github PK
View Code? Open in Web Editor NEWA scrapy project to extract the text and metadata of articles from news websites
A scrapy project to extract the text and metadata of articles from news websites
The site scraped by the ap
crawler http://bigstory.ap.org has been taken down. There may be a suitable replacement at http://apnews.com but the scraper will need rewriting.
Hi,
I've downloaded your project, but I don't know how to run it. Help me please
Thanks in advance :)
hello
does it have anything to crawl from a date to another date
2020-01-17 11:41:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.usatoday.com/story/sports/sports-betting/2020/01/17/anaheim-ducks-at-carolina-hurricanes-odds-picks-and-best-bets/41014307/> (referer: https://www.usatoday.com/news-sitemap.xml)
Traceback (most recent call last):
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/processors.py", line 59, in __call__
value = func(value)
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/processors.py", line 107, in __call__
return self.separator.join(values)
File "/mnt/data1/NLP/RISJbot-master/RISJbot/loaders.py", line 42, in _strip_strl
yield s.strip()
AttributeError: 'int' object has no attribute 'strip'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/__init__.py", line 159, in _process_input_value
return proc(value)
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/processors.py", line 63, in __call__
(str(func), value, type(e).__name__, str(e)))
ValueError: Error in Compose with <scrapy.loader.processors.Join object at 0x7f4bf64d1dd0> value=<generator object _strip_strl at 0x7f4be104a8d0> error='AttributeError: 'int' object has no attribute 'strip''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/mnt/data1/NLP/RISJbot-master/RISJbot/spiders/newssitemapspider.py", line 29, in parse
return self.parse_page(response)
File "/mnt/data1/NLP/RISJbot-master/RISJbot/spiders/us/usatoday.py", line 60, in parse_page
l.add_schemaorg(response)
File "/mnt/data1/NLP/RISJbot-master/RISJbot/loaders.py", line 171, in add_schemaorg
rdfa=False)
File "/mnt/data1/NLP/RISJbot-master/RISJbot/loaders.py", line 184, in add_schemaorg_mde
self.add_value('keywords', data.get('keywords'))
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/__init__.py", line 78, in add_value
self._add_value(field_name, value)
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/__init__.py", line 92, in _add_value
processed_value = self._process_input_value(field_name, value)
File "/home/euler/miniconda3/envs/pytorch/lib/python3.7/site-packages/scrapy/loader/__init__.py", line 164, in _process_input_value
value, type(e).__name__, str(e)))
ValueError: Error with input processor Compose: field='keywords' value=[2] error='ValueError: Error in Compose with <scrapy.loader.processors.Join object at 0x7f4bf64d1dd0> value=<generator object _strip_strl at 0x7f4be104a8d0> error='AttributeError: 'int' object has no attribute 'strip'''
from .aws_credentials import *
i run the command scrapy crawl yahoo but it gave me this error
ModuleNotFoundError: No module named 'RISJbot.aws_credentials'
my system is Windows 10
python == 3.7
I keep getting key error spider not found:CNN when I run scrapy crawl cnn
or for any news website. What directory am I supposed to run that in? The README is very vague.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.