Giter Site home page Giter Site logo

Make the API declarative about surgeon HOT 26 CLOSED

gajus avatar gajus commented on June 8, 2024
Make the API declarative

from surgeon.

Comments (26)

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024 1

I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already!

Keep it up 👍

from surgeon.

gajus avatar gajus commented on June 8, 2024 1

DaniGuardiola commented 2 minutes ago

I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already!

Keep it up 👍

Don't rush into it @DaniGuardiola. This API is going to still change. (I appreciate the enthusiasm, though.)

I am going to post an update in the next 15 minutes describing whats wrong with the above and how it can be improved.

from surgeon.

gajus avatar gajus commented on June 8, 2024 1

Also, thinking about performance, did you think about "stacking" cheerio
(or whatever) calls somehow? Like, for example:

Thats already being done.

from surgeon.

gajus avatar gajus commented on June 8, 2024

To continue the example from the documentation:

x('body', {
  title: x('title'),
  articles: x('article {0,}', {
    body: x('[email protected]'),
    summary: x('.body p {0,}[0]'),
    imageUrl: x('img@src'),
    title: x('.title')
  })
});

A declarative approach would look something like this:

{
  "selector": "body",
  "properties": {
    "title": "title",
    "articles": {
      "selector": "article {0,}",
      "properties": {
        "body": "[email protected]",
        "summary": ".body p {0,}[0]",
        "imageUrl": "img@src",
        "title": ".title"
      }
    }
  }
}

The equivalent in YAML is even shorter:

selector: body
properties:
  title: title
  articles:
    selector: article {0,}
    properties:
      body: [email protected]
      summary: .body p {0,}[0]
      imageUrl: img@src
      title: .title

This also makes the syntax for declaring validators and formatters more intuitive, e.g.

selector: body
properties:
  title: title
  articles:
    selector: article {0,}
    properties:
      body: [email protected]
      summary: .body p {0,}[0]
      imageUrl: img@src
      title:
        selector: .title
        test: /foo/
        format: "upperCase"

Where test (property of title) is a regular expression used to validate the result, and format (also a property of title) is used to format the result (as requested #3).

Minus the custom DSL used in the selector, this is becoming a lot like https://github.com/rla/dom-eee. (Might be a good thing.)

We could also add JSON schema support, e.g.

selector: .movie
schema: "movie"
properties:
  name: ".name"
  url: ".url"

Where schema: "movie" refers to a JSON schema loaded at a time of constructing a Surgeon instance.


I was looking for "declarative scraper" and found this, https://github.com/ContentMine/scraperJSON. Its not mature or anything, but demonstrates an attempt to write a declarative scraper. There is even a Node.js implementation, https://github.com/ContentMine/thresher.

I like the idea of the regex capture groups,

regex - an Object specifying a regular expression whose groups should be captured as the results. The results will be an array of the captured groups. If the global flag (g) is specified, the result will be an array of arrays of captured groups. There are two keys allowed:
- source - a string specifying the regular expression to be executed. Required
- flags - an array specifying the regex flags to be used (g, m, i, etc.). Optional (omitting this key will cause the regex to be executed with no flags).

There is also https://github.com/drbig/grabber

from surgeon.

gajus avatar gajus commented on June 8, 2024

@rla any thoughts on this?

from surgeon.

sllvn avatar sllvn commented on June 8, 2024

By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify format: "myAwesomeTransform")?

from surgeon.

gajus avatar gajus commented on June 8, 2024

By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify format: "myAwesomeTransform")?

Thats the idea.

I am convinced transforms should be added, though. Extracting data and formatting data are two very different tasks.

from surgeon.

rla avatar rla commented on June 8, 2024

The focus of DOM-EEE was mainly on the extraction part as the code involved there had tendency to get too complex. The idea was to get simpler objects from DOM by cutting down most of the element noise. The simplified object tree is supposed to be easier to work with using basic language constructs, such as loops or Array methods. This is what I experienced in multiple projects. The second goal was extreme portability. This made JSON input-output mandatory. If the project is mainly to be used from JavaScript environments then this might not be the optimal choice.

Validation and transformation can also be represented in the declarative form as string identifiers or as arrays of them, like

{
  selector: '.date',
  transform: 'convert-date',
  validity: 'date-not-in-future'
}

The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex. The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date.

One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me.

JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them.

from surgeon.

gajus avatar gajus commented on June 8, 2024

JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them.

In terms of choosing an implementation, Ajv (https://github.com/epoberezkin/ajv) is now a somewhat de facto standard in the JavaScript community.

However, I agree with your point that it can be done outside of Surgeon.

One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me.

What about throwing an error? (like Surgeon does at a present time)

This way you are sure that no unexpected behaviour is left unseen.

The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex.

Agree.

The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date.

What is a use case for wanting to apply a validator after formatting the message?

Wouldn't a formatter throw an error if it cannot format the data to the desired format?

from surgeon.

rla avatar rla commented on June 8, 2024

@gajus,

What about throwing an error? (like Surgeon does at a present time)

This needs some mechanism to mark optional properties for the case when an element sometimes exists but sometimes not but you would like to use it when it exists.

What is a use case for wanting to apply a validator after formatting the message?

To decouple date parsing from checking whether the date is in the future, for example. Composability, otherwise you need something like parseDateButAlsoCheckItIsInFuture single transform. I can see that some sort of pipeline could be defined, maybe even represented similar to shell pipes, like apply: 'parseDate|checkDateInFuture' where validation is just an identity transform that throws error when the condition does not validate.

from surgeon.

gajus avatar gajus commented on June 8, 2024

Wouldn't you agree that checkDateInFuture is a filter feature rather than validation?

The fact that whatever document contains date thats in the past does not make it an invalid date. It just a data set that you are not interested in. "validate" here would do no good since it would throw an error (and break the scraper). A filter could be used though.

from surgeon.

rla avatar rla commented on June 8, 2024

@gajus,

Wouldn't you agree that checkDateInFuture is a filter feature rather than validation?

A better example is parsing an URL and checking if it contains a specific query parameter. The validation here means guarding against the changed URL structure, not for filtering a set of URLs on the page. Filtering probably needs to be described as well, maybe as a separate issue.

from surgeon.

gajus avatar gajus commented on June 8, 2024

Going back to the original question:

What is a use case for wanting to apply a validator after formatting the message?

If you have retrieved a URL, you can use the validator to assert that URL schema has not changed. Where does the formatting come in?

from surgeon.

rla avatar rla commented on June 8, 2024

The use case is where I want to parse the URL only once, not in the validator and not in the later steps.

from surgeon.

gajus avatar gajus commented on June 8, 2024

The use case is where I want to parse the URL only once, not in the validator and not in the later steps.

For performance purposes?

from surgeon.

gajus avatar gajus commented on June 8, 2024

That said, you wouldn't be parsing URL for validation purposes... in most cases a regex would be enough.

Unless of course your intention is to ignore new parameters being added to the URL or parameters changing order. This sounds dangerous, though.

from surgeon.

gajus avatar gajus commented on June 8, 2024

A bit of case study.

I took an existing scraper (https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae/8943410d8e39d1eb013b11ec0d5ae50471829c09) and attempted to rewrite it using a declarative API.

Note:

It is a rather complicated example. I have chosen it intentionally to discover the edge cases.

Lets start with:

scrapeVenues

export const scrapeVenues = async () => {
  const $ = await request('get', 'http://www.mk2.com/', 'html');

  return mapSelector($('#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"]'), (venue) => {
    let nid;

    venue.find('span').remove();

    nid = venue.attr('href');
    nid = extractMatch(/\/salles\/(.+)/, nid);

    const url = 'http://www.mk2.com/salles/' + nid;
    const name = extractTextFromElement(venue);

    if (!_.includes(nid, 'mk2-')) {
      throw new Error('Unexpected nid.');
    }

    return {
      guide: {
        url
      },
      result: {
        name,
        nid: nid.substr(4),
        url
      }
    };
  });
};

This is simple:

export const scrapeVenues = async () => {
  const document = await request('get', 'http://www.mk2.com/', 'html');

  const x = surgeon(document);

  const venues = x({
    properties: {
      name: {
        selector: '::text()'
      },
      nid: {
        match: '/salles/mk2-(.+)',
        selector: '::attribute(href)'
      },
      url: {
        selector: '::attribute(href)'
      }
    },
    selector: '#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"] {1,}'
  });

  return venues.map((venue) => {
    return {
      guide: {
        url: 'http://www.mk2.com/salles/mk2-' + venue.url
      },
      result: {
        name: venue.name,
        nid: venue.nid,
        url: 'http://www.mk2.com/salles/mk2-' + venue.url
      }
    }
  });
};

It can be made even more succinct if we allow to:

  • declare selector "properties" property value as a string assuming a single query
  • allow to inline inbuilt methods into the query

Example:

const venues = x({
  properties: {
    name: '::text()',
    nid: '::attribute(href)::match("/salles/mk2-(.+)")',
    url: '::attribute(href)'
  },
  selector: '#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"] {1,}'
});

Thats succinct and unambiguous.

scrapeMovies

Next comes:

export const scrapeMovies = async (guide) => {
  const $ = await request('get', guide.url, 'html');

  return mapSelector($('#seances .l-mk2-tables .l-session-table .fiche-film-info .fiche-film-title'), (movieElement) => {
    return {
      guide: {
        movieElement
      },
      result: {
        name: extractTextFromElement(movieElement)
      }
    };
  });
};

The first problem is the selector:

#seances .l-mk2-tables .l-session-table .fiche-film-info .fiche-film-title

scrapeMovies selects the movie elements, then passes an instance of the resulting cheerio selector to scrapeShowtimes, then scrapeShowtimes is using parent selector tr to find the corresponding movie table row. Using the parent selector is bad because a scrapeShowtimes should work only on the information it is provided (the identifier of an element, the element, etc.); it shouldn't be capable to iterate the DOM upwards.

We cannot fix this by changing scapeMovies selector to #seances .l-mk2-tables .l-session-table tr because this will include rows that do not have the movie information. This can be solved using a parent selector. (I have raised a proposal #8 to add has() function.)

The next problem is that movieElement is an instance of a cheerio selector. This is bad because it makes it hard to log program inputs and outputs (guide.movieElement is being passed to scrapeShowtimes). Therefore, a simple solution is to create a selector that uniquely represents the element. (I have created a proposal for selector() function).

Which gives us something like this:

export const scrapeMovies = async (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document);

  const movies = x({
    properties: {
      "name": ".fiche-film-title",
      "movieElementSelector": "tr::selector()"
    },
    selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}'
  });

  return movies.map((movie) => {
    return {
      guide: {
        url: movie.url,
        movieElementSelector: movie.movieElementSelector
      },
      result: {
        name: movie.name
      }
    }
  });
};

scrapeShowtimes

This takes us to scrapeShowtimes.

export const scrapeShowtimes = (guide) => {
  return mapSelector(guide.movieElement.parents('tr').find('.item-list a[href^="/reservation"]'), (timeElement) => {
    let date;
    let showtime;

    const text = extractTextFromElement(timeElement);
    const time = extractTime(text, 'HH[h]mm');
    const version = extractMatch(/(VOST|VO|VF)/, text);

    date = timeElement.parents('.l-session-table').find('.table-header .l-schedule-days').attr('id');
    date = extractDate(date, 'YYYYMMDD');

    showtime = {
      time: date + ' ' + time,
      url: 'http://www.mk2.com' + timeElement.attr('href')
    };

    showtime = _.assign(showtime, scrapeLanguageAttributes(version));

    return {
      result: showtime
    };
  });
};

Where do I start 🤦.

First, this reveals an error in the scrapeMovies. The same movie appears multiple times in the document.

If you look at the target document, the structure is (pseudo markup) <date><movie /><movie /></date><date><movie /><movie /></date>.

That means that the scrapeMovies needs to be modified to include a unique reference to the movie. I am going to use the movie URL for that.

const movies = x({
  properties: {
    name: '.fiche-film-title',
    movieUrl: 'a[href^="/films/"]::attribute(href)'
  },
  selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}',
  uniqueBy: 'url'
});

I have added an ad-hoc helper uniqueBy (equivalent to _.uniqBy) to make the list of movies unique.
@todo Write a proposal.

Now we need to iterate each date and find the movie.

const times = x({
  has: "a[href='${url}']",
  properties: {},
  selector: '#seances .l-mk2-tables .l-session-table'
}, {
  parameters: {
    url: '/films/fleur-tonnerre'
  }
});

I have added parameters configuration. Parameters can be referred to using ${parameter name} syntax. This allows us to filter elements using a dynamic condition.

@todo Write a proposal.

Finally, we need to extract the data:

const x = surgeon(document, {
  parameters: {
    url: '/films/fleur-tonnerre'
  },
  formatters: {
    extractDate: (input, ...args) => {},
    extractTime: (input, ...args) => {}
  }
});

const times = x({
  has: "a[href='${url}']",
  properties: {
    date: '.table-header .l-schedule-days::attribute(id)::extractDate(YYYYMMDD)',
    times: {
      selector: '.item-list a[href^="/reservation"]',
      properties: {
        time: '::text()::extractTime(HH[h]mm)',
        version: '::text()::match("(VOST|VO|VF)")'
      }
    }
  }
  selector: '#seances .l-mk2-tables .l-session-table'
});

I am using formatters (helper functions) to format the result.

Which gives us:

export const scrapeShowtimes = (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document, {
    parameters: {
      url: guide.movieUrl
    },
    formatters: {
      extractDate,
      extractTime
    }
  });

  const dates = x({
    has: 'a[href="${url}"]',
    properties: {
      date: '.table-header .l-schedule-days::attribute(id)::extractDate(YYYYMMDD)',
      events: {
        selector: '.item-list a[href^="/reservation"]',
        properties: {
          url: '::attribute(href)',
          time: '::text()::extractTime(HH[h]mm)',
          version: '::text()::match("(VOST|VO|VF)")'
        }
      }
    }
    selector: '#seances .l-mk2-tables .l-session-table'
  });

  return _.flatten(dates.map((date) => {
    return date.events.map((event) => {
      return {
        time: date.date + ' ' + event.time,
        url: 'http://www.mk2.com' + event.url
      };
    });
  }));
};

The end result is this.

https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae/ba0926962fe57ef68b060664a50a5c01d272b8b2

There is quite a large chunk of development involved to make this work. I'd really appreciate a review and suggestions for improvement.

from surgeon.

gajus avatar gajus commented on June 8, 2024

@bitshadow please have a look at this too.

from surgeon.

gajus avatar gajus commented on June 8, 2024

I have implemented a variation of above API in the declarative-api branch.

{
  "adopt": {
    "articles": {
      "imageUrl": {
        "extract": {
          "name": "href",
          "type": "attribute"
        },
        "select": "img"
      },
      "summary": "p:first-child",
      "title": ".title"
    },
    "pageTitle": "h1"
  },
  "select": "main"
}

Or even shorter using action expressions.

{  
  "adopt": {
    "pageTitle": "::self",
    "articles": {
      "body": ".body @extract(attribute, innerHtml)",
      "imageUrl": "img @extract(attribute, src)",
      "summary": "p:first-child @extract(attribute, innerHtml)",
      "title": ".title"
    }
  },
  "pageTitle": "main > h1"
}

from surgeon.

gajus avatar gajus commented on June 8, 2024

As I was saying, while working on the above API, I have realised that a lot more powerful API would allow to perform subroutines, e.g.

{
  "articles": [
    {
      "action": "select",
      "selector": "article"
    },
    {
      "action": "adopt",
      "children": {
        "body": [
          {
            "action": "select",
            "selector": ".body"
          },
          {
            "action": "extract",
            "name": "innerHTML",
            "type": "property"
          }
        ],
        "imageUrl": [
          {
            "action": "select",
            "selector": "img"
          },
          {
            "action": "extract",
            "name": "src",
            "type": "attribute"
          }
        ],
        "summary": [
          {
            "action": "select",
            "selector": ".body p:first-child"
          },
          {
            "action": "extract",
            "type": "property",
            "name": "innerHTML"
          },
          {
            "action": "format",
            "name": "text"
          }
        ],
        "title": [
          {
            "action": "select",
            "selector": ".title"
          },
          {
            "action": "extract",
            "name": "textContent",
            "type": "property"
          }
        ]
      }
    }
  ],
  "pageName": [
    {
      "action": "select",
      "selector": ".body"
    },
    {
      "action": "extract",
      "name": "innerHTML",
      "type": "property"
    }
  ]
}

Because of the formatting, this looks huge. However, if we go back to using DSL, it becomes manageable:

{
  "articles": [
    "select article",
    {
      "body": [
        "select .body",
        "extract property innerHTML"
      ],
      "imageUrl": [
        "select img",
        "extract attribute src"
      ],
      "summary": [
        "select .body p:first-child",
        "extract property innerHTML",
        "format text"
      ],
      "title": [
        "select .title",
        "extract property textContent"
      ]
    }
  ],
  "pageName": [
    "select .body",
    "extract property innerHTML"
  ]
}

If we use use YAML, the entire thing is even more simple to read:

articles:
- select article
- body:
  - select .body
  - extract property innerHTML
  imageUrl:
  - select img
  - extract attribute src
  summary:
  - select .body p:first-child
  - extract property innerHTML
  - format text
  title:
  - select .title
  - extract property textContent
pageName:
- select .body
- extract property innerHTML

The benefit of the latter approach over the current implementation is that it enables combination or arbitrary test and format functions, e.g.

pageName:
- select .body
- extract property innerHTML
- format extractFirstTextNode
- format extractTime
- test timeInFuture

I also think that it is easier to read and debug, because all commands are read (and executed) from top-to-bottom. Therefore, following the progress log is as simple as following the schema.

from surgeon.

gajus avatar gajus commented on June 8, 2024

@rla @DaniGuardiola @licyeus what are your thoughts about this approach vs the earlier?

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024

from surgeon.

gajus avatar gajus commented on June 8, 2024

If we use use YAML, the entire thing is even more simple to read:

articles:
- select article
- body:
  - select .body
  - extract property innerHTML
  imageUrl:
  - select img
  - extract attribute src
  summary:
  - select .body p:first-child
  - extract property innerHTML
  - format text
  title:
  - select .title
  - extract property textContent
pageName:
- select .body
- extract property innerHTML

This could be even further reduced using (optional) a pipe operator:

articles:
- select article
- body: select .body | extract property innerHTML
  imageUrl: select img | extract attribute src
  summary: select .body p:first-child | extract property innerHTML | format text
  title: select .title | extract property textContent
pageName: select .body | extract property innerHTML

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024

Edit: reformatted (email replies don't support markdown) and corrected some of the text (my english is not the best)

About the select and extract shorthand thing, I just came up with an idea:

This:

articles:
- select article
- body: select .body | extract property innerHTML
  imageUrl: select img | extract attribute src
  summary: select .body p:first-child | extract property innerHTML | format text
  title: select .title | extract property textContent
  pageName: select .body | extract property innerHTML

Would become this:

articles:
- select article
- body: .body { property innerHTML }
  imageUrl: img { attribute src }
  summary: .body p:first-child { property innerHTML } | format text
  title: .title { property textContent }
  pageName: .body { property innerHTML }

Also possible shorthand for getting textContent which is
extremely common:

articles: 
- select article
- title: .title {}

For attributes:

articles:
- select article
- imageUrl: img { [src] }

Some other ideas:

Use { propertyName } as a shorthand for properties when no action is
declared and [ attributeName ] for attributes.

Also for innerHTML and textContent, writing "html" and "text" instead
of the actual property name should make it clearer as those are the
two most common scraped properties.

In addition, it might even be a good idea to remove the need for
piping right after using { } or [ ]

All of this, including the textContent shorthand, would result in:

articles:
- select article
- body: .body {html}
  imageUrl: img [src]
  summary: .body p:first-child {innerHTML} format text
  title: .title {} // or .title {text}
  pageName: .body {html}

Of course all of these shorthand expressions must be optional, but I
think it would make an awesomely great addition, this would really
simplify scraping, it makes it even beautiful I dare to say! (And it's
a lot to say in the messy web-scraping world)

Thank you! :)

from surgeon.

DaniGuardiola avatar DaniGuardiola commented on June 8, 2024

I just saw a flaw on my proposal: selectors targeting attributes would be conflictive with the [ ] shorthand but maybe something can be done about that

from surgeon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.