Giter Site home page Giter Site logo

Comments (13)

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024 1

@foundAhandle I don't know how much experience you have with scraping, but I've only had to use a headless browser once (and I've scraped and crawled a lot of different things). In that case, a weird complicated encryption (anti-scraping) system was in place and I decided it was easier to just run a PhantomJS setup on my local machine for a one-off scrap.

But in pretty much any other case, it was always easier, more straightforward and 100x faster and cheaper to just find out the server APIs and use them. When you click a button to receive data, a request is being made to a server with a certain interface. You just need to understand how those endpoints work and use them. Believe, it's waaaay cleaner and cheaper, and will save you some headaches. And money.

clicking buttons, working through search engines - in order to get to the data

You don't need to actually load the page to get the data :)

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024 1

@foundAhandle there are a few specific cases where headless browsers might make sense tho, like taking screenshots. But those are very marginal and can be implemented separately as helpers (scrap everything with cheerio / surgeon / request and have a function load a specific url on headless chrome to take a screenshot).

Believe me, that makes much more sense when scraping. Headless browser scraping is un-scalable and a big headache when your project starts growing in volume and complexity.

from surgeon.

gajus avatar gajus commented on June 8, 2024

Sure you can.

Can you share an example of a query?

You are probably simply missing quotes.

from surgeon.

foundAhandle avatar foundAhandle commented on June 8, 2024

I found out about Surgeon through your comment at the bottom of A Guide to Automating & Scraping the Web with JavaScript (Chrome + Puppeteer + Node JS).

I disagree with your statement that it is a "really silly idea to use Puppeteer to “scrape the web”". Scraping sites is more than just extracting data, its also navigating those sites - clicking buttons, working through search engines - in order to get to the data. That's where Puppeteer/Chrome or CasperJS/PhantomJS come into play. I've used both combinations to make web crawlers and they make things many times easier.

Also, your assertion that the example web site is a SPA is not correct. Clicking on a book link navigates the browser to different page each time thus not making it a single page application.

In spite of this, I'm trying to use Surgeon with Puppeteer and that's where I ran into the Issue that I filed. Here's a gist.

from surgeon.

gajus avatar gajus commented on June 8, 2024

In spite of this, I'm trying to use Surgeon with Puppeteer and that's where I ran into the Issue that I filed. Here's a gist.

As I said, you are missing quotes around your selector, i.e. it should be sm '[sitetranslationname="$barstate_80"]' | rtc.

from surgeon.

foundAhandle avatar foundAhandle commented on June 8, 2024

@gajus I didn't include an outer set of quotes per the note in the docs for Built-in subroutine aliases where it states: "Note regarding s ... alias. The CSS selector value is quoted."

@DaniGuardiola I have to scrape https://lw.com from its home page, through its search engine, and extract data (name, email, phone, practice, description, etc) from all of the attorney bios. Based on your comments, how would you go about doing that?

from surgeon.

gajus avatar gajus commented on June 8, 2024

@gajus I didn't include an outer set of quotes per the note in the docs for Built-in subroutine aliases where it states: "Note regarding s ... alias. The CSS selector value is quoted."

It should be augmented to say: ... unless the expression itself includes quotes.

i.e. sm .foo does not require quotes; sm '[foo="bar"]' needs quotes.

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024

@foundAhandle ok let me help you with this :)

First load the page, open the devtools (I'm assuming Chrome) and go to the network tab. Filter to only show 'XHR' requests as that's the kind of request that most applications usually do. It usually helps to click the 'clear' button to make things easier. Now you're ready to inspect the requests.

Proceed with the action you'll be inspecting, in this case the usage of the search engine. You will see a request on the devtools. Click it to get the details. You will see the URL and the query parameters being used. You just need to understand how these parameters work (just use the UI to select your desired parameters and trigger the search with the UI button).

Then you parse the response, which will probably be JSON (which can be easily parsed with JSON.parse) or HTML (use @gajus tool for that).

That will probably give you a list that you can iterate by changing the parameters, that will contain basic data and the item URL. Then, with that list, you can proceed to scrap those pages for complete details.

This would be a very good approach. I recommend MongoDB and the 'request' npm module, that will make your life way easier. And you'll be surprised of how fast your scraper will run in comparison with the headless browser solution.

Let me know if you have any questions / need help with anything :)

from surgeon.

foundAhandle avatar foundAhandle commented on June 8, 2024

@DaniGuardiola I'm familiar with parsing requests and looking at GET and POST name/value pairs. What exactly is your workflow for getting the xhr request from chrome to the npm request module and/or mongo? Are you using HAR files at all? How are you handling cookies, sessions, etc.?

from surgeon.

foundAhandle avatar foundAhandle commented on June 8, 2024

@DaniGuardiola You still there? So I've been checking out the request module and I ran into a problem, how do I execute client side code? The two pages I've tested it on both have dom-altering code that injects the elements that I need. What's the solution?

The links:
https://www.skadden.com/professionals?skip=1000&letter=A
https://www.gtlaw.com/en/professionals?pageNum=100&letter=A

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024

@foundAhandle sorry, busy days. I see you sent me an email. I'll give you a few contact options and I will try to reserve 15 minutes soon to call you and assist you if you want. It will be faster :)

from surgeon.

foundAhandle avatar foundAhandle commented on June 8, 2024

OK. Sounds good.

from surgeon.

foundAhandle avatar foundAhandle commented on June 8, 2024

Hey Dani. I got your email, unfortunately, my emails to you are being blocked. I'd like to talk, I sent you an email from [email protected] with my phone number. Please call me or respond to that email. Thanks.

from surgeon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.