Giter Site home page Giter Site logo

Comments (11)

Gallaecio avatar Gallaecio commented on July 18, 2024 1

parse was just a suggestion for a quick test. The right approach would probably be to use the engine_started signal.

from scrapy.

wRAR avatar wRAR commented on July 18, 2024

Please provide a complete minimal reproducible example.

from scrapy.

keatonLiu avatar keatonLiu commented on July 18, 2024

This is a reproduce example:

import scrapy
from motor.core import AgnosticCollection, AgnosticDatabase
from motor.motor_asyncio import (
    AsyncIOMotorClient as MotorClient,
)
from pymongo.errors import OperationFailure
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings


class QccSpiderBase(scrapy.Spider, metaclass=ABCMeta):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.settings: Settings = get_project_settings()

        self.client: MotorClient = MotorClient(self.settings['MONGO_URI'])
        self.db_name = self.settings['DB_NAME']
        self.db: Optional[AgnosticDatabase] = self.client[self.db_name]
        self.collection: Optional[AgnosticCollection] = self.db[self.collection_name]

    async def parse(self, response, **kwargs):
        async with await self.client.start_session() as s:
            count = await self.collection.count_documents({"compName": compName}, limit=1, session=s)

from scrapy.

Gallaecio avatar Gallaecio commented on July 18, 2024

Is async with await correct? Shouldn’t it be async with?

from scrapy.

Gallaecio avatar Gallaecio commented on July 18, 2024

Regarding the actual issue, does it work if you move the client creation to the parse method?

from scrapy.

keatonLiu avatar keatonLiu commented on July 18, 2024

It does not raise an error before I update my libraries including scrapy and motor to the latest, I don't know if it is a new feature in twisted.
It looks like the error happens because motor client created a new asyncio loop other than using the loop used by the crawler's reactor. If I don't specifiy the io_loop paramter when initialize, it will create a new loop when first getting the loop.
I've fixed it by moving the motor client initialization in parse(), and set the event loop to the current event loop used by scrapy's twisted reactor.

    def setup_mongo_client(self):
        self.client = MotorClient(self.settings['MONGO_URI'], io_loop=get_event_loop())
        self.db = self.client[self.db_name]
        self.collection = self.db[self.collection_name]
       
        print(f"self.client.get_io_loop(): {self.client.io_loop}")
        print(f"get_event_loop(): {get_event_loop()}")
        reactor = sys.modules["twisted.internet.reactor"]
        print(f"twisted.internet.reactor: {reactor._asyncioEventloop}")

    def parse(self, response, **kwargs):
        self.setup_mongo_client()

from scrapy.

keatonLiu avatar keatonLiu commented on July 18, 2024

Regarding the actual issue, does it work if you move the client creation to the parse method?

Yes, it is the solution, but I used to initialize it in init and it always works fine

from scrapy.

wRAR avatar wRAR commented on July 18, 2024

motor client created a new asyncio loop other than using the loop used by the crawler's reactor. If I don't specifiy the io_loop paramter when initialize, it will create a new loop when first getting the loop.

Makes sense.

from scrapy.

keatonLiu avatar keatonLiu commented on July 18, 2024

Let me correct it, motor client uses motor.frameworks.asyncio.get_event_loop() by default to get the current running loop or create a new loop when it does not exist. If I put the initialzation in init, the reactor haven't created a loop so it will create a new one.

It does not raise an error before I update my libraries including scrapy and motor to the latest, I don't know if it is a new feature in twisted. It looks like the error happens because motor client created a new asyncio loop other than using the loop used by the crawler's reactor. If I don't specifiy the io_loop paramter when initialize, it will create a new loop when first getting the loop. I've fixed it by moving the motor client initialization in parse(), and set the event loop to the current event loop used by scrapy's twisted reactor.

    def setup_mongo_client(self):
        self.client = MotorClient(self.settings['MONGO_URI'], io_loop=get_event_loop())
        self.db = self.client[self.db_name]
        self.collection = self.db[self.collection_name]
       
        print(f"self.client.get_io_loop(): {self.client.io_loop}")
        print(f"get_event_loop(): {get_event_loop()}")
        reactor = sys.modules["twisted.internet.reactor"]
        print(f"twisted.internet.reactor: {reactor._asyncioEventloop}")

    def parse(self, response, **kwargs):
        self.setup_mongo_client()

from scrapy.

keatonLiu avatar keatonLiu commented on July 18, 2024

But why is scrapy not using the current event loop when there already have one to avoid this error?

from scrapy.

wRAR avatar wRAR commented on July 18, 2024

Scrapy currently uses asyncio.get_event_loop() to get the loop.

from scrapy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.