Comments (11)
parse
was just a suggestion for a quick test. The right approach would probably be to use the engine_started signal.
from scrapy.
Please provide a complete minimal reproducible example.
from scrapy.
This is a reproduce example:
import scrapy
from motor.core import AgnosticCollection, AgnosticDatabase
from motor.motor_asyncio import (
AsyncIOMotorClient as MotorClient,
)
from pymongo.errors import OperationFailure
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
class QccSpiderBase(scrapy.Spider, metaclass=ABCMeta):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.settings: Settings = get_project_settings()
self.client: MotorClient = MotorClient(self.settings['MONGO_URI'])
self.db_name = self.settings['DB_NAME']
self.db: Optional[AgnosticDatabase] = self.client[self.db_name]
self.collection: Optional[AgnosticCollection] = self.db[self.collection_name]
async def parse(self, response, **kwargs):
async with await self.client.start_session() as s:
count = await self.collection.count_documents({"compName": compName}, limit=1, session=s)
from scrapy.
Is async with await
correct? Shouldn’t it be async with
?
from scrapy.
Regarding the actual issue, does it work if you move the client creation to the parse method?
from scrapy.
It does not raise an error before I update my libraries including scrapy and motor to the latest, I don't know if it is a new feature in twisted.
It looks like the error happens because motor client created a new asyncio loop other than using the loop used by the crawler's reactor. If I don't specifiy the io_loop paramter when initialize, it will create a new loop when first getting the loop.
I've fixed it by moving the motor client initialization in parse(), and set the event loop to the current event loop used by scrapy's twisted reactor.
def setup_mongo_client(self):
self.client = MotorClient(self.settings['MONGO_URI'], io_loop=get_event_loop())
self.db = self.client[self.db_name]
self.collection = self.db[self.collection_name]
print(f"self.client.get_io_loop(): {self.client.io_loop}")
print(f"get_event_loop(): {get_event_loop()}")
reactor = sys.modules["twisted.internet.reactor"]
print(f"twisted.internet.reactor: {reactor._asyncioEventloop}")
def parse(self, response, **kwargs):
self.setup_mongo_client()
from scrapy.
Regarding the actual issue, does it work if you move the client creation to the parse method?
Yes, it is the solution, but I used to initialize it in init and it always works fine
from scrapy.
motor client created a new asyncio loop other than using the loop used by the crawler's reactor. If I don't specifiy the io_loop paramter when initialize, it will create a new loop when first getting the loop.
Makes sense.
from scrapy.
Let me correct it, motor client uses motor.frameworks.asyncio.get_event_loop()
by default to get the current running loop or create a new loop when it does not exist. If I put the initialzation in init, the reactor haven't created a loop so it will create a new one.
It does not raise an error before I update my libraries including scrapy and motor to the latest, I don't know if it is a new feature in twisted. It looks like the error happens because motor client created a new asyncio loop other than using the loop used by the crawler's reactor. If I don't specifiy the io_loop paramter when initialize, it will create a new loop when first getting the loop. I've fixed it by moving the motor client initialization in parse(), and set the event loop to the current event loop used by scrapy's twisted reactor.
def setup_mongo_client(self): self.client = MotorClient(self.settings['MONGO_URI'], io_loop=get_event_loop()) self.db = self.client[self.db_name] self.collection = self.db[self.collection_name] print(f"self.client.get_io_loop(): {self.client.io_loop}") print(f"get_event_loop(): {get_event_loop()}") reactor = sys.modules["twisted.internet.reactor"] print(f"twisted.internet.reactor: {reactor._asyncioEventloop}") def parse(self, response, **kwargs): self.setup_mongo_client()
from scrapy.
But why is scrapy not using the current event loop when there already have one to avoid this error?
from scrapy.
Scrapy currently uses asyncio.get_event_loop()
to get the loop.
from scrapy.
Related Issues (20)
- Contradiction in Documentation about installing scrapy HOT 1
- Test fails when pytest runs without pytest-cov argument HOT 1
- The first rule in a robots.txt with BOM will be ignored HOT 1
- Need support for making blank requests HOT 4
- PyPy tests fail HOT 1
- max_active_size gives no warning when queue processing blocked (can cause deadlock when deferring items in a pipeline) HOT 11
- Document the deprecation and removal of response_httprepr HOT 6
- ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. scrapy 2.11.0 requires Twisted<23.8.0,>=18.9.0, but you have twisted 23.10.0 which is incompatible. HOT 2
- Cleanup deprecated fingerprint code in scrapy.utils.request
- Import scrapy showing error HOT 3
- Twisted and asyncio HOT 7
- Implement get_import _path
- Replace urlparse with urlparse_cached where possible
- execution of asyncio.ensure_future(coro()) ignored on close_spider() pipelines call HOT 5
- Explanation of the robots.txt exclusion standard in DownloaderMiddleware.robotstxt.py HOT 1
- Use `defusedxml.xmlrpc`
- AttributeError: 'Decompressor' object has no attribute 'process' HOT 8
- Fix and re-enable `unnecessary-comprehension` and `use-dict-literal` pylint tags
- Investigate speeding up `MockServer()` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapy.