Comments (26)
I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already!
Keep it up 👍
from surgeon.
DaniGuardiola commented 2 minutes ago
I just discovered this tool and also this alternative API and I'm quite hyped. If this becomes stable I'll be talking to my CTO to refactor our scrapers. I'm preparing the proposal already!
Keep it up 👍
Don't rush into it @DaniGuardiola. This API is going to still change. (I appreciate the enthusiasm, though.)
I am going to post an update in the next 15 minutes describing whats wrong with the above and how it can be improved.
from surgeon.
Also, thinking about performance, did you think about "stacking" cheerio
(or whatever) calls somehow? Like, for example:
Thats already being done.
from surgeon.
To continue the example from the documentation:
x('body', {
title: x('title'),
articles: x('article {0,}', {
body: x('[email protected]'),
summary: x('.body p {0,}[0]'),
imageUrl: x('img@src'),
title: x('.title')
})
});
A declarative approach would look something like this:
{
"selector": "body",
"properties": {
"title": "title",
"articles": {
"selector": "article {0,}",
"properties": {
"body": "[email protected]",
"summary": ".body p {0,}[0]",
"imageUrl": "img@src",
"title": ".title"
}
}
}
}
The equivalent in YAML is even shorter:
selector: body
properties:
title: title
articles:
selector: article {0,}
properties:
body: [email protected]
summary: .body p {0,}[0]
imageUrl: img@src
title: .title
This also makes the syntax for declaring validators and formatters more intuitive, e.g.
selector: body
properties:
title: title
articles:
selector: article {0,}
properties:
body: [email protected]
summary: .body p {0,}[0]
imageUrl: img@src
title:
selector: .title
test: /foo/
format: "upperCase"
Where test
(property of title
) is a regular expression used to validate the result, and format
(also a property of title
) is used to format the result (as requested #3).
Minus the custom DSL used in the selector, this is becoming a lot like https://github.com/rla/dom-eee. (Might be a good thing.)
We could also add JSON schema support, e.g.
selector: .movie
schema: "movie"
properties:
name: ".name"
url: ".url"
Where schema: "movie"
refers to a JSON schema loaded at a time of constructing a Surgeon instance.
I was looking for "declarative scraper" and found this, https://github.com/ContentMine/scraperJSON. Its not mature or anything, but demonstrates an attempt to write a declarative scraper. There is even a Node.js implementation, https://github.com/ContentMine/thresher.
I like the idea of the regex capture groups,
regex - an Object specifying a regular expression whose groups should be captured as the results. The results will be an array of the captured groups. If the global flag (
g
) is specified, the result will be an array of arrays of captured groups. There are two keys allowed:
-source
- a string specifying the regular expression to be executed. Required
-flags
- an array specifying the regex flags to be used (g
,m
,i
, etc.). Optional (omitting this key will cause the regex to be executed with no flags).
There is also https://github.com/drbig/grabber
from surgeon.
@rla any thoughts on this?
from surgeon.
By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify format: "myAwesomeTransform"
)?
from surgeon.
By limiting the schema to JSON, you do limit the transforms that can be done re: #3. Or would there be a way to define custom transforms and reference those by key (such that I could specify format: "myAwesomeTransform")?
Thats the idea.
I am convinced transforms should be added, though. Extracting data and formatting data are two very different tasks.
from surgeon.
The focus of DOM-EEE was mainly on the extraction part as the code involved there had tendency to get too complex. The idea was to get simpler objects from DOM by cutting down most of the element noise. The simplified object tree is supposed to be easier to work with using basic language constructs, such as loops or Array methods. This is what I experienced in multiple projects. The second goal was extreme portability. This made JSON input-output mandatory. If the project is mainly to be used from JavaScript environments then this might not be the optimal choice.
Validation and transformation can also be represented in the declarative form as string identifiers or as arrays of them, like
{
selector: '.date',
transform: 'convert-date',
validity: 'date-not-in-future'
}
The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex. The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date.
One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me.
JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them.
from surgeon.
JSON schema can be applied to the output independently of this library. If we built in support, we would have to choose an implementation. As I have understood there are draft 3 and 4 of JSON schema with huge differences and various packages pick arbitrarily what to support from either of them.
In terms of choosing an implementation, Ajv (https://github.com/epoberezkin/ajv) is now a somewhat de facto standard in the JavaScript community.
However, I agree with your point that it can be done outside of Surgeon.
One of the DOM-EEE design aspects is that non-matching selectors return null, making catch-all validation easy. The output just has to be checked for nulls. This with some amounts of "manually executed" checks has proven to cover lots of cases for me.
What about throwing an error? (like Surgeon does at a present time)
This way you are sure that no unexpected behaviour is left unseen.
The actual transforms and validators need to be defined and registered with the library first. Non-existing transforms and validators can then be easily checked for. I see that defining them in-line directly on the declarative form can make it too complex.
Agree.
The order in which to apply validations and transforms is not clear tho. We might want to check if the selector matches at all or actually validate the transformed date.
What is a use case for wanting to apply a validator after formatting the message?
Wouldn't a formatter throw an error if it cannot format the data to the desired format?
from surgeon.
What about throwing an error? (like Surgeon does at a present time)
This needs some mechanism to mark optional properties for the case when an element sometimes exists but sometimes not but you would like to use it when it exists.
What is a use case for wanting to apply a validator after formatting the message?
To decouple date parsing from checking whether the date is in the future, for example. Composability, otherwise you need something like parseDateButAlsoCheckItIsInFuture
single transform. I can see that some sort of pipeline could be defined, maybe even represented similar to shell pipes, like apply: 'parseDate|checkDateInFuture'
where validation is just an identity transform that throws error when the condition does not validate.
from surgeon.
Wouldn't you agree that checkDateInFuture
is a filter feature rather than validation?
The fact that whatever document contains date thats in the past does not make it an invalid date. It just a data set that you are not interested in. "validate" here would do no good since it would throw an error (and break the scraper). A filter could be used though.
from surgeon.
Wouldn't you agree that checkDateInFuture is a filter feature rather than validation?
A better example is parsing an URL and checking if it contains a specific query parameter. The validation here means guarding against the changed URL structure, not for filtering a set of URLs on the page. Filtering probably needs to be described as well, maybe as a separate issue.
from surgeon.
Going back to the original question:
What is a use case for wanting to apply a validator after formatting the message?
If you have retrieved a URL, you can use the validator to assert that URL schema has not changed. Where does the formatting come in?
from surgeon.
The use case is where I want to parse the URL only once, not in the validator and not in the later steps.
from surgeon.
The use case is where I want to parse the URL only once, not in the validator and not in the later steps.
For performance purposes?
from surgeon.
That said, you wouldn't be parsing URL for validation purposes... in most cases a regex would be enough.
Unless of course your intention is to ignore new parameters being added to the URL or parameters changing order. This sounds dangerous, though.
from surgeon.
A bit of case study.
I took an existing scraper (https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae/8943410d8e39d1eb013b11ec0d5ae50471829c09) and attempted to rewrite it using a declarative API.
Note:
It is a rather complicated example. I have chosen it intentionally to discover the edge cases.
Lets start with:
scrapeVenues
export const scrapeVenues = async () => {
const $ = await request('get', 'http://www.mk2.com/', 'html');
return mapSelector($('#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"]'), (venue) => {
let nid;
venue.find('span').remove();
nid = venue.attr('href');
nid = extractMatch(/\/salles\/(.+)/, nid);
const url = 'http://www.mk2.com/salles/' + nid;
const name = extractTextFromElement(venue);
if (!_.includes(nid, 'mk2-')) {
throw new Error('Unexpected nid.');
}
return {
guide: {
url
},
result: {
name,
nid: nid.substr(4),
url
}
};
});
};
This is simple:
export const scrapeVenues = async () => {
const document = await request('get', 'http://www.mk2.com/', 'html');
const x = surgeon(document);
const venues = x({
properties: {
name: {
selector: '::text()'
},
nid: {
match: '/salles/mk2-(.+)',
selector: '::attribute(href)'
},
url: {
selector: '::attribute(href)'
}
},
selector: '#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"] {1,}'
});
return venues.map((venue) => {
return {
guide: {
url: 'http://www.mk2.com/salles/mk2-' + venue.url
},
result: {
name: venue.name,
nid: venue.nid,
url: 'http://www.mk2.com/salles/mk2-' + venue.url
}
}
});
};
It can be made even more succinct if we allow to:
- declare selector "properties" property value as a string assuming a single query
- allow to inline inbuilt methods into the query
Example:
const venues = x({
properties: {
name: '::text()',
nid: '::attribute(href)::match("/salles/mk2-(.+)")',
url: '::attribute(href)'
},
selector: '#footer p:contains(Les salles MK2) + .item-list a[href^="/salles/"] {1,}'
});
Thats succinct and unambiguous.
scrapeMovies
Next comes:
export const scrapeMovies = async (guide) => {
const $ = await request('get', guide.url, 'html');
return mapSelector($('#seances .l-mk2-tables .l-session-table .fiche-film-info .fiche-film-title'), (movieElement) => {
return {
guide: {
movieElement
},
result: {
name: extractTextFromElement(movieElement)
}
};
});
};
The first problem is the selector:
#seances .l-mk2-tables .l-session-table .fiche-film-info .fiche-film-title
scrapeMovies
selects the movie elements, then passes an instance of the resulting cheerio
selector to scrapeShowtimes
, then scrapeShowtimes
is using parent selector tr
to find the corresponding movie table row. Using the parent selector is bad because a scrapeShowtimes
should work only on the information it is provided (the identifier of an element, the element, etc.); it shouldn't be capable to iterate the DOM upwards.
We cannot fix this by changing scapeMovies
selector to #seances .l-mk2-tables .l-session-table tr
because this will include rows that do not have the movie information. This can be solved using a parent selector. (I have raised a proposal #8 to add has()
function.)
The next problem is that movieElement
is an instance of a cheerio
selector. This is bad because it makes it hard to log program inputs and outputs (guide.movieElement
is being passed to scrapeShowtimes
). Therefore, a simple solution is to create a selector that uniquely represents the element. (I have created a proposal for selector()
function).
Which gives us something like this:
export const scrapeMovies = async (guide) => {
const document = await request('get', guide.url, 'html');
const x = surgeon(document);
const movies = x({
properties: {
"name": ".fiche-film-title",
"movieElementSelector": "tr::selector()"
},
selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}'
});
return movies.map((movie) => {
return {
guide: {
url: movie.url,
movieElementSelector: movie.movieElementSelector
},
result: {
name: movie.name
}
}
});
};
scrapeShowtimes
This takes us to scrapeShowtimes
.
export const scrapeShowtimes = (guide) => {
return mapSelector(guide.movieElement.parents('tr').find('.item-list a[href^="/reservation"]'), (timeElement) => {
let date;
let showtime;
const text = extractTextFromElement(timeElement);
const time = extractTime(text, 'HH[h]mm');
const version = extractMatch(/(VOST|VO|VF)/, text);
date = timeElement.parents('.l-session-table').find('.table-header .l-schedule-days').attr('id');
date = extractDate(date, 'YYYYMMDD');
showtime = {
time: date + ' ' + time,
url: 'http://www.mk2.com' + timeElement.attr('href')
};
showtime = _.assign(showtime, scrapeLanguageAttributes(version));
return {
result: showtime
};
});
};
Where do I start 🤦.
First, this reveals an error in the scrapeMovies
. The same movie appears multiple times in the document.
If you look at the target document, the structure is (pseudo markup)
<date><movie /><movie /></date><date><movie /><movie /></date>
.
That means that the scrapeMovies
needs to be modified to include a unique reference to the movie. I am going to use the movie URL for that.
const movies = x({
properties: {
name: '.fiche-film-title',
movieUrl: 'a[href^="/films/"]::attribute(href)'
},
selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}',
uniqueBy: 'url'
});
I have added an ad-hoc helper
uniqueBy
(equivalent to_.uniqBy
) to make the list of movies unique.
@todo Write a proposal.
Now we need to iterate each date and find the movie.
const times = x({
has: "a[href='${url}']",
properties: {},
selector: '#seances .l-mk2-tables .l-session-table'
}, {
parameters: {
url: '/films/fleur-tonnerre'
}
});
I have added parameters
configuration. Parameters can be referred to using ${parameter name}
syntax. This allows us to filter elements using a dynamic condition.
@todo Write a proposal.
Finally, we need to extract the data:
const x = surgeon(document, {
parameters: {
url: '/films/fleur-tonnerre'
},
formatters: {
extractDate: (input, ...args) => {},
extractTime: (input, ...args) => {}
}
});
const times = x({
has: "a[href='${url}']",
properties: {
date: '.table-header .l-schedule-days::attribute(id)::extractDate(YYYYMMDD)',
times: {
selector: '.item-list a[href^="/reservation"]',
properties: {
time: '::text()::extractTime(HH[h]mm)',
version: '::text()::match("(VOST|VO|VF)")'
}
}
}
selector: '#seances .l-mk2-tables .l-session-table'
});
I am using formatters (helper functions) to format the result.
Which gives us:
export const scrapeShowtimes = (guide) => {
const document = await request('get', guide.url, 'html');
const x = surgeon(document, {
parameters: {
url: guide.movieUrl
},
formatters: {
extractDate,
extractTime
}
});
const dates = x({
has: 'a[href="${url}"]',
properties: {
date: '.table-header .l-schedule-days::attribute(id)::extractDate(YYYYMMDD)',
events: {
selector: '.item-list a[href^="/reservation"]',
properties: {
url: '::attribute(href)',
time: '::text()::extractTime(HH[h]mm)',
version: '::text()::match("(VOST|VO|VF)")'
}
}
}
selector: '#seances .l-mk2-tables .l-session-table'
});
return _.flatten(dates.map((date) => {
return date.events.map((event) => {
return {
time: date.date + ' ' + event.time,
url: 'http://www.mk2.com' + event.url
};
});
}));
};
The end result is this.
There is quite a large chunk of development involved to make this work. I'd really appreciate a review and suggestions for improvement.
from surgeon.
@bitshadow please have a look at this too.
from surgeon.
I have implemented a variation of above API in the declarative-api branch.
{
"adopt": {
"articles": {
"imageUrl": {
"extract": {
"name": "href",
"type": "attribute"
},
"select": "img"
},
"summary": "p:first-child",
"title": ".title"
},
"pageTitle": "h1"
},
"select": "main"
}
Or even shorter using action expressions.
{ "adopt": { "pageTitle": "::self", "articles": { "body": ".body @extract(attribute, innerHtml)", "imageUrl": "img @extract(attribute, src)", "summary": "p:first-child @extract(attribute, innerHtml)", "title": ".title" } }, "pageTitle": "main > h1" }
from surgeon.
As I was saying, while working on the above API, I have realised that a lot more powerful API would allow to perform subroutines, e.g.
{
"articles": [
{
"action": "select",
"selector": "article"
},
{
"action": "adopt",
"children": {
"body": [
{
"action": "select",
"selector": ".body"
},
{
"action": "extract",
"name": "innerHTML",
"type": "property"
}
],
"imageUrl": [
{
"action": "select",
"selector": "img"
},
{
"action": "extract",
"name": "src",
"type": "attribute"
}
],
"summary": [
{
"action": "select",
"selector": ".body p:first-child"
},
{
"action": "extract",
"type": "property",
"name": "innerHTML"
},
{
"action": "format",
"name": "text"
}
],
"title": [
{
"action": "select",
"selector": ".title"
},
{
"action": "extract",
"name": "textContent",
"type": "property"
}
]
}
}
],
"pageName": [
{
"action": "select",
"selector": ".body"
},
{
"action": "extract",
"name": "innerHTML",
"type": "property"
}
]
}
Because of the formatting, this looks huge. However, if we go back to using DSL, it becomes manageable:
{
"articles": [
"select article",
{
"body": [
"select .body",
"extract property innerHTML"
],
"imageUrl": [
"select img",
"extract attribute src"
],
"summary": [
"select .body p:first-child",
"extract property innerHTML",
"format text"
],
"title": [
"select .title",
"extract property textContent"
]
}
],
"pageName": [
"select .body",
"extract property innerHTML"
]
}
If we use use YAML, the entire thing is even more simple to read:
articles:
- select article
- body:
- select .body
- extract property innerHTML
imageUrl:
- select img
- extract attribute src
summary:
- select .body p:first-child
- extract property innerHTML
- format text
title:
- select .title
- extract property textContent
pageName:
- select .body
- extract property innerHTML
The benefit of the latter approach over the current implementation is that it enables combination or arbitrary test and format functions, e.g.
pageName:
- select .body
- extract property innerHTML
- format extractFirstTextNode
- format extractTime
- test timeInFuture
I also think that it is easier to read and debug, because all commands are read (and executed) from top-to-bottom. Therefore, following the progress log is as simple as following the schema.
from surgeon.
@rla @DaniGuardiola @licyeus what are your thoughts about this approach vs the earlier?
from surgeon.
from surgeon.
If we use use YAML, the entire thing is even more simple to read:
articles: - select article - body: - select .body - extract property innerHTML imageUrl: - select img - extract attribute src summary: - select .body p:first-child - extract property innerHTML - format text title: - select .title - extract property textContent pageName: - select .body - extract property innerHTML
This could be even further reduced using (optional) a pipe operator:
articles:
- select article
- body: select .body | extract property innerHTML
imageUrl: select img | extract attribute src
summary: select .body p:first-child | extract property innerHTML | format text
title: select .title | extract property textContent
pageName: select .body | extract property innerHTML
from surgeon.
from surgeon.
Edit: reformatted (email replies don't support markdown) and corrected some of the text (my english is not the best)
About the select and extract shorthand thing, I just came up with an idea:
This:
articles:
- select article
- body: select .body | extract property innerHTML
imageUrl: select img | extract attribute src
summary: select .body p:first-child | extract property innerHTML | format text
title: select .title | extract property textContent
pageName: select .body | extract property innerHTML
Would become this:
articles:
- select article
- body: .body { property innerHTML }
imageUrl: img { attribute src }
summary: .body p:first-child { property innerHTML } | format text
title: .title { property textContent }
pageName: .body { property innerHTML }
Also possible shorthand for getting textContent which is
extremely common:
articles:
- select article
- title: .title {}
For attributes:
articles:
- select article
- imageUrl: img { [src] }
Some other ideas:
Use { propertyName } as a shorthand for properties when no action is
declared and [ attributeName ] for attributes.
Also for innerHTML and textContent, writing "html" and "text" instead
of the actual property name should make it clearer as those are the
two most common scraped properties.
In addition, it might even be a good idea to remove the need for
piping right after using { } or [ ]
All of this, including the textContent shorthand, would result in:
articles:
- select article
- body: .body {html}
imageUrl: img [src]
summary: .body p:first-child {innerHTML} format text
title: .title {} // or .title {text}
pageName: .body {html}
Of course all of these shorthand expressions must be optional, but I
think it would make an awesomely great addition, this would really
simplify scraping, it makes it even beautiful I dare to say! (And it's
a lot to say in the messy web-scraping world)
Thank you! :)
from surgeon.
I just saw a flaw on my proposal: selectors targeting attributes would be conflictive with the [ ] shorthand but maybe something can be done about that
from surgeon.
Related Issues (20)
- Cannot get trivial `read property innerHTML` example working HOT 3
- a[href*="pricebands"] does not select HOT 1
- single and double quotes don't work HOT 13
- The automated release is failing 🚨
- License type? HOT 1
- javascript evaluation HOT 4
- webpack @babel/preset-env module build failed error in import statement HOT 11
- xpath support HOT 2
- css inside selector: Invalid quantifier expression HOT 2
- bindle (aka context object) hardcoded? how to use it HOT 4
- rtc read textContent does not insert whitespace between elements
- Feature request: Custom options in context HOT 1
- Make syntax engine selectable like evaluator is HOT 1
- TypeScript support HOT 3
- Friendly aliases HOT 2
- Request to add a tutorial / how to use HOT 1
- remove not implemented in browser HOT 2
- Image mount on parse
- Add a command line interface to surgeon in the spirit of jq? HOT 1
- Please add *basic* `@types` modules for this and Pianola! HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from surgeon.