Comments (13)
@foundAhandle I don't know how much experience you have with scraping, but I've only had to use a headless browser once (and I've scraped and crawled a lot of different things). In that case, a weird complicated encryption (anti-scraping) system was in place and I decided it was easier to just run a PhantomJS setup on my local machine for a one-off scrap.
But in pretty much any other case, it was always easier, more straightforward and 100x faster and cheaper to just find out the server APIs and use them. When you click a button to receive data, a request is being made to a server with a certain interface. You just need to understand how those endpoints work and use them. Believe, it's waaaay cleaner and cheaper, and will save you some headaches. And money.
clicking buttons, working through search engines - in order to get to the data
You don't need to actually load the page to get the data :)
from surgeon.
@foundAhandle there are a few specific cases where headless browsers might make sense tho, like taking screenshots. But those are very marginal and can be implemented separately as helpers (scrap everything with cheerio / surgeon / request and have a function load a specific url on headless chrome to take a screenshot).
Believe me, that makes much more sense when scraping. Headless browser scraping is un-scalable and a big headache when your project starts growing in volume and complexity.
from surgeon.
Sure you can.
Can you share an example of a query?
You are probably simply missing quotes.
from surgeon.
I found out about Surgeon through your comment at the bottom of A Guide to Automating & Scraping the Web with JavaScript (Chrome + Puppeteer + Node JS).
I disagree with your statement that it is a "really silly idea to use Puppeteer to “scrape the web”". Scraping sites is more than just extracting data, its also navigating those sites - clicking buttons, working through search engines - in order to get to the data. That's where Puppeteer/Chrome or CasperJS/PhantomJS come into play. I've used both combinations to make web crawlers and they make things many times easier.
Also, your assertion that the example web site is a SPA is not correct. Clicking on a book link navigates the browser to different page each time thus not making it a single page application.
In spite of this, I'm trying to use Surgeon with Puppeteer and that's where I ran into the Issue that I filed. Here's a gist.
from surgeon.
In spite of this, I'm trying to use Surgeon with Puppeteer and that's where I ran into the Issue that I filed. Here's a gist.
As I said, you are missing quotes around your selector, i.e. it should be sm '[sitetranslationname="$barstate_80"]' | rtc
.
from surgeon.
@gajus I didn't include an outer set of quotes per the note in the docs for Built-in subroutine aliases where it states: "Note regarding s ... alias. The CSS selector value is quoted."
@DaniGuardiola I have to scrape https://lw.com from its home page, through its search engine, and extract data (name, email, phone, practice, description, etc) from all of the attorney bios. Based on your comments, how would you go about doing that?
from surgeon.
@gajus I didn't include an outer set of quotes per the note in the docs for Built-in subroutine aliases where it states: "Note regarding s ... alias. The CSS selector value is quoted."
It should be augmented to say: ... unless the expression itself includes quotes.
i.e. sm .foo
does not require quotes; sm '[foo="bar"]'
needs quotes.
from surgeon.
@foundAhandle ok let me help you with this :)
First load the page, open the devtools (I'm assuming Chrome) and go to the network tab. Filter to only show 'XHR' requests as that's the kind of request that most applications usually do. It usually helps to click the 'clear' button to make things easier. Now you're ready to inspect the requests.
Proceed with the action you'll be inspecting, in this case the usage of the search engine. You will see a request on the devtools. Click it to get the details. You will see the URL and the query parameters being used. You just need to understand how these parameters work (just use the UI to select your desired parameters and trigger the search with the UI button).
Then you parse the response, which will probably be JSON (which can be easily parsed with JSON.parse) or HTML (use @gajus tool for that).
That will probably give you a list that you can iterate by changing the parameters, that will contain basic data and the item URL. Then, with that list, you can proceed to scrap those pages for complete details.
This would be a very good approach. I recommend MongoDB and the 'request' npm module, that will make your life way easier. And you'll be surprised of how fast your scraper will run in comparison with the headless browser solution.
Let me know if you have any questions / need help with anything :)
from surgeon.
@DaniGuardiola I'm familiar with parsing requests and looking at GET and POST name/value pairs. What exactly is your workflow for getting the xhr request from chrome to the npm request module and/or mongo? Are you using HAR files at all? How are you handling cookies, sessions, etc.?
from surgeon.
@DaniGuardiola You still there? So I've been checking out the request module and I ran into a problem, how do I execute client side code? The two pages I've tested it on both have dom-altering code that injects the elements that I need. What's the solution?
The links:
https://www.skadden.com/professionals?skip=1000&letter=A
https://www.gtlaw.com/en/professionals?pageNum=100&letter=A
from surgeon.
@foundAhandle sorry, busy days. I see you sent me an email. I'll give you a few contact options and I will try to reserve 15 minutes soon to call you and assist you if you want. It will be faster :)
from surgeon.
OK. Sounds good.
from surgeon.
Hey Dani. I got your email, unfortunately, my emails to you are being blocked. I'd like to talk, I sent you an email from [email protected] with my phone number. Please call me or respond to that email. Thanks.
from surgeon.
Related Issues (20)
- Cannot get trivial `read property innerHTML` example working HOT 3
- a[href*="pricebands"] does not select HOT 1
- The automated release is failing 🚨
- License type? HOT 1
- javascript evaluation HOT 4
- webpack @babel/preset-env module build failed error in import statement HOT 11
- xpath support HOT 2
- css inside selector: Invalid quantifier expression HOT 2
- bindle (aka context object) hardcoded? how to use it HOT 4
- rtc read textContent does not insert whitespace between elements
- Feature request: Custom options in context HOT 1
- Make syntax engine selectable like evaluator is HOT 1
- TypeScript support HOT 3
- Friendly aliases HOT 2
- Request to add a tutorial / how to use HOT 1
- remove not implemented in browser HOT 2
- Image mount on parse
- Add a command line interface to surgeon in the spirit of jq? HOT 1
- Please add *basic* `@types` modules for this and Pianola! HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from surgeon.