Giter Site home page Giter Site logo

surgeon's Introduction

Surgeon

GitSpo Mentions Travis build status Coveralls NPM version Canonical Code Style Twitter Follow

Declarative DOM extraction expression evaluator.

Powerful, succinct, composable, extendable, declarative API.

articles:
- select article {0,}
- body:
  - select .body
  - read property innerHTML
  imageUrl:
  - select img
  - read attribute src
  summary:
  - select ".body p:first-child"
  - read property innerHTML
  - format text
  title:
  - select .title
  - read property textContent
pageName:
- select .body
- read property innerHTML

Not succinct enough for you? Use aliases and the pipe operator (|) to shorten and concatenate the commands:

articles:
- sm article
- body: s .body | rp innerHTML
  imageUrl: s img | ra src
  summary: s .body p:first-child | rp innerHTML | f text
  title: s .title | rp textContent
pageName: s .body | rp innerHTML

Have you got suggestions for improvement? I am all ears.


Configuration

Name Type Description Default value
evaluator EvaluatorType HTML parser and selector engine. See evaluators. browser evaluator if window and document variables are present, cheerio otherwise.
subroutines $PropertyType<UserConfigurationType, 'subroutines'> User defined subroutines. See subroutines. N/A

Evaluators

Subroutines use an evaluator to parse input (i.e. convert a string into an object) and to select nodes in the resulting document.

The default evaluator is configured based on the user environment:

Have a use case for another evaluator? Raise an issue.

For an example implementation of an evaluator, refer to:

browser evaluator

Uses native browser methods to parse the document and to evaluate CSS selector queries.

Use browser evaluator if you are running Surgeon in a browser or a headless browser (e.g. PhantomJS).

import {
  browserEvaluator
} from './evaluators';

surgeon({
  evaluator: browserEvaluator()
});

cheerio evaluator

Uses cheerio to parse the document and to evaluate CSS selector queries.

Use cheerio evaluator if you are running Surgeon in Node.js.

import {
  cheerioEvaluator
} from './evaluators';

surgeon({
  evaluator: cheerioEvaluator()
});

Subroutines

A subroutine is a function used to advance the DOM extraction expression evaluator, e.g.

x('foo | bar baz', 'qux');

In the above example, Surgeon expression uses two subroutines: foo and bar.

foo subroutine is invoked without additional values. bar subroutine is executed with 1 value ("baz").

Subroutines are executed in the order in which they are defined – the result of the last subroutine is passed on to the next one. The first subroutine receives the document input (in this case: "qux" string).

Multiple subroutines can be written as an array. The following example is equivalent to the earlier example.

x([
  'foo',
  'bar baz'
], 'qux');

There are two types of subroutines:

Note:

These functions are called subroutines to emphasise the cross-platform nature of the declarative API.

Built-in subroutines

The following subroutines are available out of the box.

append subroutine

append appends a string to the input string.

Parameter name Description Default
tail Appends a string to the end of the input string. N/A

Examples:

// Assuming an element <a href='http://foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | append '/bar'`);

closest subroutine

closest subroutine iterates through all the preceding nodes (including parent nodes) searching for either a preceding node matching the selector expression or a descendant of the preceding node matching the selector.

Note: This is different from the jQuery .closest() in that the latter method does not search for parent descendants matching the selector.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A

constant subroutine

constant returns the parameter value regardless of the input.

Parameter name Description Default
constant Constant value that will be returned as the result. N/A

format subroutine

format is used to format input using printf.

Parameter name Description Default
format sprintf format used to format the input string. The subroutine input is the first argument, i.e. %1$s. %1$s

Examples:

// Extracts 1 matching capturing group from the input string.
// Prefixes the match with 'http://foo.com'.
x(`select a | read attribute href | format 'http://foo.com%1$s'`);

match subroutine

match is used to extract matching capturing groups from the subject input.

Parameter name Description Default
Regular expression Regular expression used to match capturing groups in the string. N/A
Sprintf format sprintf format used to construct a string using the matching capturing groups. %s

Examples:

// Extracts 1 matching capturing group from the input string.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)/"');

// Extracts 2 matching capturing groups from the input string and formats the output using sprintf.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)-(\d+)/" %2$s-%1$s');

nextUntil subroutine

nextUntil subroutine is used to select all following siblings of each element up to but not including the element matched by the selector.

Parameter name Description Default
selector expression A string containing a selector expression to indicate where to stop matching following sibling elements. N/A
filter expression A string containing a selector expression to match elements against.

prepend subroutine

prepend prepends a string to the input string.

Parameter name Description Default
head Prepends a string to the start of the input string. N/A

Examples:

// Assuming an element <a href='//foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | prepend 'http:'`);

previous subroutine

previous subroutine selects the preceding sibling.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A

Example:

<ul>
  <li>foo</li>
  <li class='bar'></li>
<ul>
x('select .bar | previous | read property textContent');
// 'foo'

read subroutine

read is used to extract value from the matching element using an evaluator.

Parameter name Description Default
Target type Possible values: "attribute" or "property" N/A
Target name Depending on the target type, name of an attribute or a property. N/A

Examples:

// Returns .foo element "href" attribute value.
// Throws error if attribute does not exist.
x('select .foo | read attribute href');

// Returns an array of "href" attribute values of the matching elements.
// Throws error if attribute does not exist on either of the matching elements.
x('select .foo {0,} | read attribute href');

// Returns .foo element "textContent" property value.
// Throws error if property does not exist.
x('select .foo | read property textContent');

remove subroutine

remove subroutine is used to remove elements from the document using an evaluator.

remove subroutine accepts the same parameters as the select subroutine.

The result of remove subroutine is the input of the subroutine, i.e. previous select subroutine result.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A
Quantifier expression A quantifier expression is used to control the expected result length. See quantifier expression.

Examples:

// Returns 'bar'.
x('select .foo | remove span | read property textContent', `<div class='foo'>bar<span>baz</span></div>`);

select subroutine

select subroutine is used to select the elements in the document using an evaluator.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A
Quantifier expression A quantifier expression is used to control the shape of the results (direct result or array of results) and the expected result length. See quantifier expression.
Quantifier expression

A quantifier expression is used to assert that the query matches a set number of nodes. A quantifier expression is a modifier of the select subroutine.

A quantifier expression is defined using the following syntax.

Name Syntax
Fixed quantifier {n} where n is an integer >= 1
Greedy quantifier {n,m} where n >= 0 and m >= n
Greedy quantifier {n,} where n >= 0
Greedy quantifier {,m} where m >= 1

A quantifier expression can be appended a node selector [i], e.g. {0,}[1]. This allows to return the first node from the result set.

If this looks familiar, its because I have adopted the syntax from regular expression language. However, unlike in regular expression, a quantifier in the context of Surgeon selector will produce an error (SelectSubroutineUnexpectedResultCountError) if selector result length is out of the quantifier range.

Examples:

// Selects 0 or more nodes.
// Result is an array.
x('select .foo {0,}');

// Selects 1 or more nodes.
// Throws an error if 0 matches found.
// Result is an array.
x('select .foo {1,}');

// Selects between 0 and 5 nodes.
// Throws an error if more than 5 matches found.
// Result is an array.
x('select .foo {0,5}');

// Selects 1 node.
// Result is the first match in the result set (or `null`).
x('select .foo {0,}[0]');

test subroutine

test is used to validate the current value using a regular expression.

Parameter name Description Default
Regular expression Regular expression used to test the value. N/A

Examples:

// Validates that .foo element textContent property value matches /bar/ regular expression.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | test /bar/');

See error handling for more information and usage examples of the test subroutine.

User-defined subroutines

Custom subroutines can be defined using subroutines configuration.

A subroutine is a function. A subroutine function is invoked with the following parameters:

Parameter name
An instance of [Evaluator].
Current value, i.e. value used to query Surgeon or value returned from the previous (or ancestor) subroutine.
An array of values used when referencing the subroutine in an expression.

Example:

const x = surgeon({
  subroutines: {
    mySubroutine: (currentValue, [firstParameterValue, secondParameterValue]) => {
      console.log(currentValue, firstParameterValue, secondParameterValue);

      return parseInt(currentValue, 10) + 1;
    }
  }
});

x('mySubroutine foo bar | mySubroutine baz qux', 0);

The above example prints:

0 "foo" "bar"
1 "baz" "qux"

For more examples of defining subroutines, refer to:

Inline subroutines

Custom subroutines can be inlined into pianola instructions, e.g.

x(
  [
    'foo',
    (subject) => {
      // `subject` is the return value of `foo` subroutine.

      return 'bar';
    },
    'baz',
  ],
  'qux'
);

Built-in subroutine aliases

Surgeon exports an alias preset is used to reduce verbosity of the queries.

Name Description
ra ... Reads Element attribute value. Equivalent to read attribute ...
rdtc ... Removes any descending elements and reads the resulting textContent property of an element. Equivalent to `remove * {0,}
rih ... Reads innerHTML property of an element. Equivalent to read property ... innerHTML
roh ... Reads outerHTML property of an element. Equivalent to read property ... outerHTML
rp ... Reads Element property value. Equivalent to read property ...
rtc ... Reads textContent property of an element. Equivalent to read property ... textContent
sa ... Select any (sa). Selects multiple elements (0 or more). Returns array. Equivalent to select "..." {0,}
saf ... Select any first (saf). Selects multiple elements (0 or more). Returns single result or null. Equivalent to select "..." {0,}[0]
sm ... Select many (sm). Selects multiple elements (1 or more). Returns array. Equivalent to select "..." {1,}
smo ... Select maybe one (smo). Selects one element. Returns single result or null. Equivalent to select "..." {0,1}[0]
so ... Select one (so). Selects a single element. Returns single result. Equivalent to select "..." {1}[0].
t {name} Tests value. Equivalent to test ...

Note regarding s ... alias. The CSS selector value is quoted. Therefore, you can write a CSS selector that includes spaces without putting the value in the quotes, e.g. s .foo .bar is equivalent to select ".foo .bar" {1}.

Other alias values are not quoted. Therefore, if value includes a space it must be quoted, e.g. t "/foo bar/".

Usage:

import surgeon, {
  subroutineAliasPreset
} from 'surgeon';

const x = surgeon({
  subroutines: {
    ...subroutineAliasPreset
  }
});

x('s .foo .bar | t "/foo bar/"');

In addition to the built-in aliases, user can declare subroutine aliases.

Expression reference

Surgeon subroutines are referenced using expressions.

An expression is defined using the following pseudo-grammar:

subroutines ->
    subroutines _ "|" _ subroutine
  | subroutine

subroutine ->
    subroutineName " " parameters
  | subroutineName

subroutineName ->
  [a-zA-Z0-9\-_]:+

parameters ->
    parameters " " parameter
  | parameter

Example:

x('foo bar baz', 'qux');

In this example, Surgeon query executor (x) is invoked with foo bar baz expression and qux starting value. The expression tells the query executor to run foo subroutine with parameter values "bar" and "baz". The expression executor runs foo subroutine with parameter values "bar" and "baz" and subject value "qux".

Multiple subroutines can be combined using an array:

x([
  'foo bar baz',
  'corge grault garply'
], 'qux');

In this example, Surgeon query executor (x) is invoked with two expressions (foo bar baz and corge grault garply). The first subroutine is executed with the subject value "qux". The second subroutine is executed with a value that is the result of the parent subroutine.

The result of the query is the result of the last subroutine.

Read user-defined subroutines documentation for broader explanation of the role of the parameter values and the subject value.

The pipe operator (|)

Multiple subroutines can be combined using the pipe operator.

The following examples are equivalent:

x([
  'foo bar baz',
  'qux quux quuz'
]);

x([
  'foo bar baz | foo bar baz'
]);

x('foo bar baz | foo bar baz');

Cookbook

Unless redefined, all examples assume the following initialisation:

import surgeon from 'surgeon';

/**
 * @param configuration {@see https://github.com/gajus/surgeon#configuration}
 */
const x = surgeon();

Extract a single node

Use select subroutine and read subroutine to extract a single value.

const subject = `
  <div class="title">foo</div>
`;

x('select .title | read property textContent', subject);

// 'foo'

Extract multiple nodes

Specify select subroutine quantifier to match multiple results.

const subject = `
  <div class="foo">bar</div>
  <div class="foo">baz</div>
  <div class="foo">qux</div>
`;

x('select .title {0,} | read property textContent', subject);

// [
//   'bar',
//   'baz',
//   'qux'
// ]

Name results

Use a QueryChildrenType object to name the results of the descending expressions.

const subject = `
  <article>
    <div class='title'>foo title</div>
    <div class='body'>foo body</div>
  </article>
  <article>
    <div class='title'>bar title</div>
    <div class='body'>bar body</div>
  </article>
`;

x([
  'select article',
  {
    body: 'select .body | read property textContent'
    title: 'select .title | read property textContent'
  }
]);

// [
//   {
//     body: 'foo body',
//     title: 'foo title'
//   },
//   {
//     body: 'bar body',
//     title: 'bar title'
//   }
// ]

Validate the results using RegExp

Use test subroutine to validate the results.

const subject = `
  <div class="foo">bar</div>
  <div class="foo">baz</div>
  <div class="foo">qux</div>
`;

x('select .foo {0,} | test /^[a-z]{3}$/');

See error handling for information how to handle test subroutine errors.

Validate the results using a user-defined test function

Define a custom subroutine to validate results using arbitrary logic.

Use InvalidValueSentinel to leverage standardised Surgeon error handler (see error handling). Otherwise, simply throw an error.

import surgeon, {
  InvalidValueSentinel
} from 'surgeon';

const x = surgeon({
  subroutines: {
    isRed: (value) => {
      if (value === 'red') {
        return value;
      };

      return new InvalidValueSentinel('Unexpected color.');
    }
  }
});

Declare subroutine aliases

As you become familiar with the query execution mechanism, typing long expressions (such as select, read attribute and read property) becomes a mundane task.

Remember that subroutines are regular functions: you can partially apply and use the partially applied functions to create new subroutines.

Example:

import surgeon, {
  readSubroutine,
  selectSubroutine,
  testSubroutine
} from 'surgeon';

const x = surgeon({
  subroutines: {
    ra: (subject, values, bindle) => {
      return readSubroutine(subject, ['attribute'].concat(values), bindle);
    },
    rp: (subject, values, bindle) => {
      return readSubroutine(subject, ['property'].concat(values), bindle);
    },
    s: (subject, values, bindle) => {
      return selectSubroutine(subject, [values.join(' '), '{1}'], bindle);
    },
    sm: (subject, values, bindle) => {
      return selectSubroutine(subject, [values.join(' '), '{0,}'], bindle);
    },
    t: testSubroutine
  }
});

Now, instead of writing:

articles:
- select article
- body:
  - select .body
  - read property innerHTML

You can write:

articles:
- sm article
- body:
  - s .body
  - rp innerHTML

The aliases used in this example are available in the aliases preset (read built-in subroutine aliases).

Error handling

Surgeon throws the following errors to indicate a predictable error state. All Surgeon errors can be imported. Use instanceof operator to determine the error type.

Note:

Surgeon errors are non-recoverable, i.e. a selector cannot proceed if it encounters an error. This design ensures that your selectors are capturing the expected data.

Name Description
ReadSubroutineNotFoundError Thrown when an attempt is made to retrieve a non-existent attribute or property.
SelectSubroutineUnexpectedResultCountError Thrown when a select subroutine result length does not match the quantifier expression.
InvalidDataError Thrown when a subroutine returns an instance of InvalidValueSentinel.
SurgeonError A generic error. All other Surgeon errors extend from SurgeonError.

Example:

import {
  InvalidDataError
} from 'surgeon';

const subject = `
  <div class="foo">bar</div>
`;

try {
  x('select .foo | test /bar/', subject);
} catch (error) {
  if (error instanceof InvalidDataError) {
    // Handle data validation error.
  } else {
    throw error;
  }
}

Return InvalidValueSentinel from a subroutine to force Surgeon throw InvalidDataError error.

Debugging

Surgeon is using roarr to log debugging information.

Export ROARR_LOG=TRUE environment variable to enable Surgeon debug log.

surgeon's People

Contributors

bcliden avatar comlock avatar gajus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

surgeon's Issues

Make the API declarative

The entire input, validation and formatting rules can be declared using a simple JSON object.

The benefit of this approach is portability, i.e. easy to move scraper from one programming language to another.

Furthermore, it easier to enforce consistent style, and maintain complexity of the code base.

We could even use https://www.npmjs.com/package/jsonscript.

add "has" function (parent selector)

  • CSS does not have a parent selector.
  • cheerio has has() selector.

Since we want Surgeon to work across different environment (i.e. browser), we'd need to implement this on the Surgeon level.

Use case

Sometimes it is required to know whether an element has a descending element to know if it is the right element, e.g. http://www.mk2.com/salles/mk2-gambetta.

To select all table rows that contain movie information, we can check if a table row includes information about the movie name.

#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title)

add "filter" function

Filter function is used to filter out matches before evaluating the quantifier-expression.

The filter supports a single expression "has".

Example:

test('finds a node which satisfies a parent node selector', (t) => {
  const x = surgeon();

  const document = `
  <div>
    <article>
      <h1>foo</h1>
    </article>
    <article>
      <h1>bar</h1>
      <p></p>
    </article>
  </div>
  `;

  const schema = {
    filter: {
      has: 'p'
    },
    properties: {
      heading: 'h1'
    },
    selector: 'article'
  };

  t.true(x(document, schema).heading === 'bar');
});

rtc read textContent does not insert whitespace between elements

The result is multiple words get bunched to getter into long "invalid" words.
This becomes a problem when you index the scraped text and want to use ngram search on it.
https://xp.readthedocs.io/en/stable/developer/search/query-functions/ngram.html

I don't know how similar the cheerio evaluator's textContent works in comparison to the browser variant,
but it might be behaving correctly.
https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent

One might have slightly better results using innerText but that is not supported by surgeon (yet).

Notice that textContent ignores <br/> while innerText does not:
http://perfectionkills.com/the-poor-misunderstood-innerText/

A workaround I could try for this is modify all block elements by adding a single space on the end of them, and then using textContent, but I'm uncertain whether its actually smart or even possible to rely on an elements style.display property.

Add ability to format the result

There has been a request to add a "formatting" ability like in scrape-it library.

Its documented as:

convert (Function): An optional function to change the value.

Example:

{
   articles: {
       listItem: ".article"
     , data: {
           createdAt: {
               selector: ".date"
+             , convert: x => new Date(x)
           }
         , title: "a.article-title"
         , tags: {
               listItem: ".tags > span"
           }
         , content: {
               selector: ".article-content"
             , how: "html"
           }
       }
   }
}

Considerations:

  • Need to consider how this integrates with validation (does formatting happen before, after)
  • Whats the API?

implied "has"

Now I am writing a lot of .event:has(.evcal_list_a):has(.evcal_event_title) to select only .event elements that contain the target elements. I wish I didn't need to do that.

Run the tests in a browser

Need to use a test runner that could automate test running in the browser. This is to test the "browser" evaluator.

css inside selector: Invalid quantifier expression

This works
select p:nth-of-type(2)|select a|read attribute href

This gives error: Invalid quantifier expression
select p:nth-of-type(2) a|read attribute href

So I believe surgeon interprets the "a" as a quantifier expression.

This gives same error
select p:nth-of-type(2) a {0,}|read attribute href

ref: https://www.w3schools.com/cssref/css_selectors.asp

Selector Example Example description
element element div p Selects all

elements inside

elements

an example would be great!

I'm trying to read the README and figure out based on my subroutines, evaluator, surgeon import, and html data how to mesh it all together.

it's tough to know how to go from the yaml config to something that actually works. for example, here is what I have so far:

import { ScraperOperation } from './_base';
import * as request from 'request';
import * as surgeon from 'surgeon';

const URL = 'http://mysite.com';

export class Scraper extends ScraperOperation {
  public scrape() {
    this.readYaml(`test.yaml`) //read as string
      .then(mainSiteInstructions => {
        console.log(mainSiteInstructions);
        request(URL, (err, res, body) => {
          if(err || res.statusCode !== 200) return;

          const x = surgeon.default({ evaluator: surgeon.cheerioEvaluator() });
          x(mainSiteInstructions, body);

        });
      });
  }
}

where the yaml file is:

alphabetLinks:
  - select a {0, } | rp title | test /Rule Index/

I just seem to continuously get errors though.

add "selector" function

Sometimes different parts of the scraper script need to access the same element.

Consider this example:

  1. scrapeMovies gets a list of movie names, https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae#file-mk2-js-L49-L62
  2. scrapeShowtimes parsers additional information about the parsed movies, https://gist.github.com/gajus/68f9da3b27a51a58db990ae67e9acdae#file-mk2-js-L83-L106

The information is scraped from the same URL (therefore, the same document).

scrapeMovies selects movie elements, then passes an instance of the resulting cheerio selector to scrapeShowtimes, then scrapeShowtimes is using parent selector tr to find the corresponding movie table row.

Using the parent selector is bad because a scrapeShowtimes should work only on the information it is provided (e.g., the identifier of an element); it shouldn't be capable to iterate the DOM upwards. Furthermore, this makes logging useless.

A better alternative would be to derive a unique selector that can be shared between the processes. The above example could be then rewritten to:

export const scrapeMovies = async (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document);

  const movies = x({
    properties: {
      "name": ".fiche-film-title",
      "movieElementSelector": "tr::selector()"
    },
    selector: '#seances .l-mk2-tables .l-session-table .fiche-film-info::has(.fiche-film-title) {0,}'
  });

  return movies.map((movie) => {
    return {
      guide: {
        url: movie.url,
        movieElementSelector: movie.movieElementSelector
      },
      result: {
        name: movie.name
      }
    }
  });
};

export const scrapeShowtimes = (guide) => {
  const document = await request('get', guide.url, 'html');

  const x = surgeon(document);

  const events = x({
    properties: {
      time: '::text()',
      version: '(VOST|VO|VF)',
      url: '::attribute(href)'
    }
    selector: [
      guide.movieElementSelector,
      '.item-list a[href^="/reservation"]'
    ]
  });

  return events.map((event) => {
    return {
      result: {
        time: event.time,
        url: 'http://www.mk2.com' + event.url
      }
    };
  });
};

The idea is that tr::selector() returns a CSS selector that given the same document will select the same element.

This example ignores "date" selection. The latter poses another complication.

The automated release is failing 🚨

🚨 The automated release from the master branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you could benefit from your bug fixes and new features.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can resolve this 💪.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the master branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here is some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.


The push permission to the Git repository is required.

semantic-release cannot push the version tag to the branch master on remote Git repository with URL https://github.com/gajus/surgeon.

Please refer to the authentication configuration documentation to configure the Git credentials on your CI environment and make sure the repositoryUrl is configured with a valid Git URL.


Good luck with your project ✨

Your semantic-release bot 📦🚀

Feature request: Custom options in context

Subroutines get access to the context with an evaluator property, and this signature of a context object allows for extending it with other things in the future.

My suggestion would be to add an options key to this, containing custom options that were specified during the initial surgeon(options) call, such that a developer can specify custom options for their custom subroutines.

A (synthetic) example:

let extract = surgeon({
	options: {
		magicNumber: 42
	},
	subroutines: {
		addMagicNumber: function (input, _values, context) {
			return `${input} ${context.options.magicNumber}`;
		}
	}
});

// ...

extract("select .foo {1}[0] | read property textContents | addMagicNumber", someContent);

EDIT: To clarify, this is particularly useful for accommodating eg. third-party custom subroutines that can take initialization options, as well as more complex constructions where a specific surgeon invocation might specify some options for that particular extraction (like, for example, switch/match arms that are difficult to express in the DSL).

allow selector array for chaining selectors

Consider the following example:

selector: [
  '.movie[data-id="1"]',
  '.item-list a[href^="/reservation"]'
]

We want to ensure .movie[data-id="1"] selector matches 1 element, then we want to select the descending .item-list a[href^="/reservation"] element.

This is different from .movie[data-id="1"] .item-list a[href^="/reservation"]. The latter would match:

<div data-id="1"></div>
<div data-id="1">
  <div class="list-item">
    <a href="/reservation"></a>
  </div>
</div>

The former would not.

An alternative is to allow use of the quantifier expression anywhere in the selector, e.g.

.movie[data-id="1"] {1}[0] .item-list a[href^="/reservation"]

add "match" function

Used to extract 1 match from the result.

This function function could be invoked either from the select, e.g.

::attribute(href)::match("/salles/(.+)")

or from the declarative API manifest, e.g.

nid: {
  selector: '::attribute(href)',
  match: '\/salles\/(.+)'
}

An implementation:

const match = (inputRule: RegExp, subject: string): string => {
  if (inputRule.ignoreCase) {
    throw new Error('"rule" parameter value must be an instance of a case sensitive RegExp.');
  }

  const rule = new RegExp(inputRule.source, 'gm');
  const matches = [];

  let match;

    // eslint-disable-next-line
    while (match = rule.exec(subject)) {
      matches.push(match[1]);
    }

  if (!matches.length) {
    throw new Error('Did not match.');
  }

  if (matches.length !== 1) {
    throw new Error('Matched more than one group.');
  }

  return matches[0];
};

Document how to combine HTTP fetching with Surgeon

As suggested by one of the programmers:

I would include a section to your README explaining how you'd combine the library with actually making HTTP requests. You could suggest a recommended approach. Otherwise it's another decision that an end user has to make, potentially leaving them to use x-ray instead.

Good call. Will include a section about how we do it at Applaudience.

Improve the DSL

  • Attribute selector (@name) could be ::attribute(name).
  • Property selector (@.name) could be ::property(name).

This way it would be clear from glancing at the query whats the intention.

Image mount on parse

all images in html start mounting when parsing!
need to sliently parse html without images fetch

Need a logo

The original meat vector was made under assumption that the library is going to be called "chop". Now that it is called "surgeon", need to think of something else.

Please add *basic* `@types` modules for this and Pianola!

These packages cannot even be used from TypeScript without types. I couldn't care less about how accurate the types are, I just need something that allows me to import and use it from a TypeScript file. I know you've said it doesn't make sense to add types for something as dynamically/loosely typed as these libraries, but that's okay: that's what any/unknown are for. TypeScript's type system is extremely flexible and makes no compromises. You can express just about anything, and when you can't, just use any.

I've tried adding types myself, but I'll just say it's not going well. Happy to send you the files I have so far if you want to expand on them, but I think the underlying library may need to export the default export as a named function instead of an anonymous closure, which is where I got stuck and why I'm stopping now. Default exports are generally bad anyway. (The default export doesn't have to go away, you can export it in two places)

Please let me know how I can help! If I can change your mind here, I will gladly sponsor you as well, happy that I can soon use this awesome library from TypeScript.

Request to add a tutorial / how to use

Greetings, I was looking for other way to scrape websites with puppeteer because it was too slow and I came across this article. However the problem is that I can't seem to find any tutorial or guide that would help me set things up and running fast since I don't have much time, also the Readme isn't really that clear.

Thank you

License type?

Hi! thanks, very useful!
This software is under MIT license.?

Friendly aliases

rtc is a short alias, but isn't friendly. So are all of them.

How about such additional aliases:

Current Additional Description
ra attr Reads Element attribute value
rdtc text Removes any descending elements and reads the resulting textContent property of an element
rih html Reads innerHTML property of an element
roh outerHtml Reads outerHTML property of an element
rp prop Reads Element property value
rtc allText Reads textContent property of an element
sa all Select any (sa). Selects multiple elements (0 or more). Returns array
sm many Select many (sm). Selects multiple elements (1 or more). Returns array
smo firstOrNull Select maybe one (smo). Selects one element. Returns single result or null
saf - ...
so first Select one (so). Selects a single element. Returns single result

Still short, but easier to remember

wdyt?

add "text()" function

Returns text of the node.

Throws error if node contains anythings that not a text.

Example:

h1::text()

Would throw an error in case of:

screen shot 2017-01-22 at 18 47 01

This is going to be combined with "extractText" function (used to extract text from the node when it contains other elements).

TypeScript support

Any plans to add types? These days it is pretty much expected out of the box.

Cannot get trivial `read property innerHTML` example working

Working through the README, I'm having trouble getting innerHTML to return anything beyond undefined.

x('select div {0,}[0] | read property textContent', '<div>hello</div>')
//=> 'hello'
x('select div {0,}[0] | read property innerHTML', '<div>hello</div>')
//=> undefined

javascript evaluation

the page I want to scrape uses Api to get content. is javascript evaluation possible in the page? how do i connect this to Phantomjs or Puppeteer?

bindle (aka context object) hardcoded? how to use it

So it looks to me that bindle is hardcoded here:
https://github.com/gajus/surgeon/blob/master/src/index.js#L79

I tried some hackery

selectBindle: (s, v) => selectSubroutine(s, v, {
	evaluator: cheerioEvaluator(),
	property: 'value'
}),
debugBindle: (s, v, bindle) => {
	log.info(toStr({bindle}));
	return bindle;
},

But when I use them like this:

test: selectBindle body|debugBindle

There is no property named property, so I'm unable to modify the bindle.

Making it possible to set the bindle init time via userConfiguration shouldn't be much work.
Making it possible to change the bindle could be useful. Perhaps expose some subroutine?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.