Giter Site home page Giter Site logo

Product Information? about metascraper HOT 6 CLOSED

microlinkhq avatar microlinkhq commented on August 31, 2024
Product Information?

from metascraper.

Comments (6)

blakeembrey avatar blakeembrey commented on August 31, 2024 3

@oyeanuj I already have https://github.com/blakeembrey/node-scrappy which is parsing out production information (JSON-LD and microdata) if you're interested. It just needs to be extracted from the resulting data set (Scrappy uses to two phase scrapping process - first scrapes all information, second creates snippets). Here's an example of production information from Airbnb (https://github.com/blakeembrey/node-scrappy/blob/master/test/fixtures/airbnb-ny-apartment/result.json#L62-L75).

@ianstormtaylor Sorry to cross-promote, we had this discussion a while back, I think. My goal is to extract known information from the page, while this one's was slightly different. I'd still be down to try to normalize them if possible.

Edit: Note that my goal is also only using standardised metadata for now, it's not scraping unknowns.

Edit 2: It's also parsing favicons, so you may want to replicate that logic into here - https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L415-L421 and https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L533-L556.

from metascraper.

ianstormtaylor avatar ianstormtaylor commented on August 31, 2024

Hey @oyeanuj, nice I agree!

Related to #11, I think it would be nice to have these different types of rules bundled as separate plugins, since they're very specific. And it doesn't really make sense for articles to be given so much weight over other types of content by being part of core. I just did it that way since it was my first needed use case.

If you end up hacking on a product bundled of scraping rules, I'd be down to split them out!

from metascraper.

ianstormtaylor avatar ianstormtaylor commented on August 31, 2024

Nice! No worries about cross-promotion at all :)

from metascraper.

blakeembrey avatar blakeembrey commented on August 31, 2024

Thanks 😄

FWIW, all the major product pages in the linked Ruby app seem to have decent metadata already on the page. Ran it on the current version of Scrappy and it extracted production information from them all (borderless/unfurl@612dff2) - all of them are using microdata. Someone just needs to use that microdata.

Edit: See result.json, that's the raw extracted data before it's shrunk into a normalized snippet.

from metascraper.

oyeanuj avatar oyeanuj commented on August 31, 2024

@blakeembrey Very cool! I'll try to go over the commit and play around with node-scrappy soon!

from metascraper.

Kikobeats avatar Kikobeats commented on August 31, 2024

Please check #41 😄

from metascraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.