Comments (7)
@untoldbyte what is the use case for using puppeteer to render the html? do you require page interaction?
from crawly.
yes
from crawly.
is there any specific reason why chrome/puppeteer instead of selenium?
from crawly.
@Ziinc we also did not use Selenium for real scraping back in time when I was working with scrapy. My personal vision of it:
- It was not stable enough (e.g. some sudden hangings)
- It did not allow to modify headers of the request: SeleniumHQ/selenium-google-code-issue-archive#141
So my understanding that selenium is not suitable for web crawling. At least was not suitable when I was looking at it last time.
I would try to build something similar to what is described here: https://blog.scrapinghub.com/how-to-use-a-proxy-in-puppeteer. It should be similar to splash. (Was about to start with the task, but this summer was a bit chaotic for me :()
from crawly.
While there is no doubt that puppeteer would be great for browser automation and there isn't much browser monoculture issues (since they are adding in firefox soon), I am not particularly against adding puppeteer as a Fetcher module. I am only concerned about how we would reconcile the page interaction and html parsing.
The key pain point to resolve this issue is how to obtain further html from page interactions.
For example, the data that we want are only rendered in modals which are toggled open with buttons. Using puppeteer, we can open up each modal and scrape the data in a single request using a single nodejs script. However, doesn't that render the spider's parse_item
callback redundant? All logic for scraping the page would be within the nodejs script itself, which prevents us from using Elixir's libraries and ecosystem, or reusing logic etc. And we don't currently have Elixir bindings for the puppeteer api, which prevents us from interacting with the browser from the parse_item
callback directly.
A few ideas I have for this problem:
- the nodejs scripts scrape html fragments, which is attached to the Response object and then passed to the parse_item callback for scraping.
- this scrapes the page twice, which is kind of inefficient
- the browser window is kept open while parse_item is called, and we interact with the page through a
exec_script/1
function, where we execute either a js fragment or a script file. This function then returns a new request with the updated body, and we can continue parsing it using elixir- I think this ensures that most parsing is done in elixir, and we only use nodejs for interacting with the browser, which i think is much more ideal and prevents over-use of js
- capturing async api requests/responses made by the js site, and parsing the responses directly, using idea 2 to interact with the page
- this would allow scraping of the api responses directly for cleaner data.
I think idea 2 is the better option, and idea 3 could be a nice-to-have
from crawly.
@Ziinc maybe you're right. I don't have a full picture of it yet. I would try to build a prototype and see what is required from the production usages.
From my experience of scraping, I would say: Scraping something from modal windows was an extremely rare [I would even say almost never used] use case... In the vast majority of the cases, a simple request to something like splash (which is also scriptable, and kind of allows to execute js on the client-side) was enough.
For now I would consider headless chrome as a way to overcome bans. Currently, our amazon spiders are blocked after 2000-3000 requests, so I am looking for a standard way to do it better,
from crawly.
we could consider microsoft's playwright, which is cross browser. Might make rotating the user agent easier
from crawly.
Related Issues (20)
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
- Error: Could not load spiders. HOT 5
- [error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run
- Encountering Complications with Forwarding to Crawly.API.Router HOT 1
- Upgrade to `httpoison` 2.x? HOT 3
- management Web UI on localhost:4001 is not working HOT 4
- Does Crawly support requests using the POST method? HOT 1
- Crawly compilation warnings, undefined Floki functions HOT 1
- CI issue: Failed to upload the report to 'https://coveralls.io', Couldn't find a repository matching this job. HOT 1
- Set base_url in init options instead of callback HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.