Giter Site home page Giter Site logo

graphql-scraper's Introduction

graphql-scraper

GraphQL lets us query all sorts of graph-shaped data - so why not use it to query the world's most useful graph, the web?

graphql-scraper is a command-line tool and reusable GraphQL schema which lets you easily extract data from HTML.

Check out a live demo here. You can easily spin up your own by using graphql-scraper-server.

The command-line tool

npx graphql-scraper <query-file>

or

npm install -g graphql-scraper
graphql-scraper <query-file>

Reads a GraphQL query from the path query-file, and prints the result.

If query-file is not given, reads the query from stdin.

Command-line options

  • --json Returns the result in JSON format, for use in other tools.
  • --help Prints a help string.

Variables

Any other named options you pass to the CLI will be used as a query variable.

For example, if you want to reuse the same query on several pages, you could write the following query file (query.graphql):

query ExampleQueryWithVariable($page: String) {
  page(url: $page) {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}

...and execute the query like this:

graphql-scraper query.graphql --page="https://news.ycombinator.com/"

The schema

You can check out an auto-generated schema description here, but I recommend trying out the graphql-scraper-server example and exploring the types interactively. You can also play around with the schema in the live demo.

Re-using the schema in your own projects

The npm package exports the GraphQL schema which is used by the command-line tool. This an instance of graphql-js GraphQLSchema, which you can use anywhere that expects a schema, for example apollo-server or graphql-yoga.

Use npm install graphql-scraper or yarn add graphql-scraper to add the schema to your project.

Basic example with graphql

import { graphql } from 'graphql'
import schema from 'graphql-scraper'
// You can also import it as follows:
// const schema = require('graphql-scraper')


const query = `
{
  page(url: "http://news.ycombinator.com") {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}
`

graphql(schema, query).then(response => {
  console.log(response)
})

Background

This project was inspired by gdom, which is written in Python and uses the Graphene GraphQL library.

If you want to switch over from gdom, please note some schema changes:

  • query(selector: String!) now only returns a single Element, rather than a list (like document.querySelector). Added a new queryAll(selector: String!): [Element] field, which behaves like document.querySelectorAll.
  • is(selector: String!) is renamed to has(selector: String!).
  • children, parent, siblings, next etc. no longer have a selector argument. If you need to select children with a specific selector, use child selectors (.foo > .bar).
  • parents is removed.
  • prev[All] is renamed to previous[All].

Maintainers

@lachenmayer

Contribute

PRs accepted.

License

MIT © 2018 harry lachenmayer

graphql-scraper's People

Contributors

lachenmayer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.