Giter Site home page Giter Site logo

jason-the-miner's Introduction

Jason the Miner

npm npm Node version

Harvesting data at the <html> mine... Jason the Miner, a versatile Web scraper for Node.js.

⛏ Features

  • Composable: via a modular architecture based on pluggable processors. The output of one processor feeds the input of the next one. There are 3 processor categories:
    1. loaders: to fetch the data (via HTTP requests, by reading text files, etc.)
    2. parsers: to parse the data (HTML by default) & extract the relevant parts according to a predefined schema
    3. transformers: to transform and/or output the results (to a CSV file, via email, etc.)
  • Configurable: each processor can be chosen & configured independently
  • Extensible: you can register your own custom processors
  • CLI-friendly: Jason the Miner works well with pipes & redirections
  • Promise-based API
  • MIT-licensed

⛏ Installing

$ npm install -g jason-the-miner

⛏ Demos

Clone the project...

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run demos

...and have a look at the "demos" folder.

⛏ Examples

CLI

Scraping the most popular Javascript scrapers from GitHub:

// github-config.json
{
  "load": {
    "http": {
      "url": "https://github.com/search",
      "params": {
        "l": "JavaScript",
        "o": "desc",
        "q": "scraper",
        "s": "stars",
        "type": "Repositories"
      }
    }
  },
  "parse": {
    "html": {
      "repos": [{
        "_$": ".repo-list .repo-list-item",
        "name": "h3 > a",
        "description": "div:first-child > p | trim"
      }]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos.json"
    }
  }
}
$ jason-the-miner -c github-config.json

Alternatively, with pipes & redirections:

// github-config.json
{
  "parse": {
    "html": {
      "repos": [{
        "_$": ".repo-list .repo-list-item",
        "name": "h3 > a",
        "description": "div:first-child > p | trim"
      }]
    }
  }
}
$ curl https://github.com/search?q=scraper&l=JavaScript&type=Repositories | jason-the-miner -c github-config.json > github-repos.json

API

const JasonTheMiner = require('jason-the-miner');

const jason = new JasonTheMiner();

const load = {
  http: {
    url: "https://github.com/search",
    params: {
      q: "scraper",
      l: "JavaScript",
      type: "Repositories",
      s: "stars",
      o: "desc"
    }
  }
};

const parse = {
  html: {
    repos: [{
      _$: ".repo-list .repo-list-item",
      name: "h3 > a",
      description: "div:first-child > p | trim"
    }]
  }
};

jason.harvest({ load, parse }).then(results => console.log(results));

⛏ The config file

{
  "load": {
    "[loader name]": {
      // loader options
    }
  },
  "parse": {
    "[parser name]": {
      // parser options
    }
  },
  "transform": {
    "[transformer name]": {
      // transformer options
    }
  }
}

Loaders

Jason the Miner comes with 3 built-in loaders:

Name Description Options
http Uses axios as HTTP client All axios request options + [_concurrency=1] (to limit the number of concurrent requests when following/paginating) & [_cache] (to cache responses on the file system)
file Reads the content of a file path, [stream=false], [encoding="utf8"] & [_concurrency=1] (to limit the number of concurrent requests when paginating)
stdin Reads the content from the standard input [encoding="utf8"]

For example, an HTTP load config with pagination (pages 1 -> 3) where responses will be cached in the "tests/http-cache" folder:

...
"load": {
  "http": {
    "baseURL": "https://github.com",
    "url": "/search?l=JavaScript&o=desc&q=scraper&s=stars&type=Repositories&p={1,3}",
    "_concurrency": 2,
    "_cache": {
      "folder": "tests/http-cache"
    }
  }
}
...

Check the "demos" folder for more examples.

Parsers

Currently, Jason the Miner comes with a single built-in parser:

Name Description Options
html Parses HTML, built with Cheerio A parse schema

Schema definition

...
  "html": {
    "title": "title | trim",
    "metas": {
      "lang": "html < attr(lang)",
      "content-type": "meta[http-equiv='Content-Type'] < attr(content)"
    },
    "stylesheets": ["link[rel='stylesheet'] < attr(href)"],
    "repos": [{
      "_$": ".repo-list .repo-list-item ? text(crawler)",
      "_slice": "0,3",
      "name": "h3 > a",
      "last-update": "relative-time < attr(datetime)",
      "_follow": {
        "_link": "h3 > a",
        "description": "meta[property='og:description'] < attr(content) | trim",
        "url": "link[rel='canonical'] < attr(href)",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count | trim",
          "stars": "li:nth-child(2) a.social-count | trim",
          "forks": "li:nth-child(3) a.social-count | trim"
        },
        "_follow": {
          "_link": ".js-repo-nav span[itemprop='itemListElement']:nth-child(2) > a",
          "open-issues": [{
            "_$": ".js-navigation-container li > div > div:nth-child(3)",
            "desc": "a:first-child | trim",
            "opened": "relative-time < attr(datetime)"
          }],
          "_paginate": {
            "link": "a[rel='next']",
            "slice": "0,1",
            "depth": 2
          }
        }
      }
    }],
  }
...

A schema is a plain object that recursively defines:

  • the names of the values/collection of values that you want to extract: "title" (single value), "metas" (object), "stylesheets" (collection of values), "repos" (collection of objects)
  • how to extract them: [selector] ? [matcher] < [extractor] | [filter] (check "Parse helpers" below)

Additional instructions can be passed to the parser:

  • _$ acts as a root selector: further parsing will happen in the context of the element identified by this selector
  • _slice limits the number of elements to parse, like String.prototype.slice(begin[, end])
  • _follow tells Jason to follow a single link (fetch new data) & to continue scraping after the new data is received
  • _paginate tells Jason to paginate (fetch & scrape new data) & to merge the new values in the current context, here multiple links can be selected to scrape in parallel multiple pages
Parse helpers

The following syntax specifies how to extract a value:

[property name]: [selector] ? [matcher] < [extractor] | [filter]

For instance:

...
"repos": [".repo-list-item h3 > a ? text(crawler) < attr(title) | trim"]
...

Will extract a "repos" array of values from the links identified by the ".repo-list-item h3 > a" selector, matching only the ones containing the text "crawler". The values will be retrieved from the "title" attribute of each link and will be trimmed.

Jason has 4 built-in element matchers:

  • text(regexString)
  • html(regexString)
  • attr(attributeName,regexString)
  • slice(begin,end)

They are used to test an element in order to decide whether to include/discard it from parsing. If not specified, Jason includes every element.

6 built-in text extractors:

  • text([optionalStaticText]) (by default)
  • html()
  • attr(attributeName)
  • regex(regexString)
  • date(inputFormat,outputFormat) (parses a date with moment)
  • uuid() (generates a uuid v1 with uuid)

and 4 built-in text filters:

  • trim
  • single-space
  • lowercase
  • uppercase

Transformers

Name Description Options
stdout Writes the results to stdout [encoding="utf8"]
json-file Writes the results to a JSON file path & [encoding="utf8"]
csv-file Writes the results to a CSV file using csv-stringify Same as csv-stringify + path & [encoding='utf8']
download-file Downloads files to a given folder using axios [baseURL], [parseKey], [folder='.'], [namePattern='{name}'], [maxSizeInMb=1] & [concurrency=1]
email Sends the results by email using nodemailer Same as nodemailer

⛏ API

constructor({ fallbacks = {} } = {})

fallbacks defines which processor to use when not explicitly configured (or missing in the config file):

  • load: 'identity',
  • parse: 'identity',
  • transform: 'identity'

The fallbacks change when using the CLI (see bin/jason-the-miner.js):

  • load: 'stdin',
  • parse: 'html',
  • transform: 'stdout'

loadConfig(configFile)

Loads a config from a JSON or JS file.

jason.loadConfig('./harvest-me.json');

harvest({ load, parse, output, pagination } = {})

Launches the harvesting process:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest())
  .catch(error => console.error(error));

You can pass custom options to temporarily override the current config:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest({
    load: {
      http: {
        url: "https://github.com/search?q=scraper&l=Python&type=Repositories"
      }
    }
  }))
  .catch(error => console.error(error));

To permanently override the current config, you can directly modify Jason's config property:

const allResults = [];

jason
  .loadConfig('./harvest-me.json')
  .then(() => jason.harvest())
  .then((results) => {
    allResults.push(results);

    jason.config.load.http.url = 'https://github.com/search?q=scraper&l=Python&type=Repositories';

    return jason.harvest();
  })
  .then((results) => {
    allResults.push(results);
  })
  .catch(error => console.error(error));
registerHelper({ category, name, helper })

Registers a parse helper in one of the 3 categories: match, extract or filter. helper must be a function.

jason.registerHelper({
  category: 'filter',
  name: 'remove-protocol',
  helper: text => text.replace(/^https?:/, '')
});
registerProcessor({ category, name, processor })

Registers a new processor in one of the 3 categories: load, parse or transform. processor must be a class implementing the run() method:

jason.registerProcessor({
  category: 'transform',
  name: 'template',
  processor: Templater
});

class Templater {
  constructor(config) {
    // receives automatically its config
  }

  /**
   * @param {*} results
   * @return {Promise.<*>}
   */
  run({ results }) {
    // must be implemented & must return a promise.
  }
}

jason.config.transform = {
  template: {
    "templatePath": "my-template.tpl",
    "outputPath": "my-page.html"
  }
};

Be aware that loaders must also implement the getConfig(), buildPaginationLinks() and buildLoadOptions({ link }) methods. Have a look at the source code for more info.

⛏ Testing

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run test

⛏ References & related links

⛏ A final note...

Please take these guidelines in consideration when scraping:

  • The content being scraped is not copyright protected.
  • The act of scraping does not burden the services of the site being scraped.
  • The scraper does not violate the Terms of Use of the site being scraped.
  • The scraper does not gather sensitive user information.
  • The scraped content adheres to fair use standards.

jason-the-miner's People

Contributors

mawrkus avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.