Giter Site home page Giter Site logo

scrapeql's Introduction

ScrapeQL

SQL-like query parser to extraxt and combine structured data from different sources like

  • HTML & XML
  • CSV (not implemented yet)
  • JSON (not implemented yet)
  • SQL Databases (not implemented yet)

#Why?

In general, it is better to use a dedicated API to process external data.

But... ...sometimes this is not possible and you need to extract and combine data from several sources. This can be a very complicated task, because you have to deal with

  • untidy HTML
  • many CSV files
  • complex code to navigate though Data
  • many loops, ifs,...

ScrapeQL gives you a powerful SQL-like query language and JQuery-like Selectors with makes fetching and combining data as simple as writing a database query.

#Features v0.2 (2015-08-07)

  • Run Queries
  • Cartesian product & where clause (INNER JOIN, CROSS JOIN, JOIN,...)
    • There is a bug in the where clause parser when using AND, OR
  • Multiple threads for loading data (waits for the slowest before merging data)
  • Basic functions

#User guide

Please take a look at the ScrapeQL Wiki

#Sample

Sample query:

SELECT tag_text(wiki.key) AS key, tag_text(wiki.value) AS value 
FROM ( 
    LOAD $('th') AS key, $('td>*') AS value 
    FROM load_html(url('https://en.wikipedia.org/wiki/Java_(programming_language)'))
    $('table.infobox>tbody>tr')
) AS wiki 
WHERE str_length(regex_replace(tag_text(wiki.value),'[\W]+','')) > 0

Result:

+====================================================+
| key                    | value                     |
|====================================================|
| Paradigm               | multi-paradigm            |
| Paradigm               | object-oriented           |
| Paradigm               | class-based               |
| Paradigm               | structured                |
| Paradigm               | imperative                |
| Paradigm               | functional                |
| Paradigm               | generic                   |
| Paradigm               | reflective                |
| Paradigm               | concurrent                |
| Designed by            | James Gosling             |
| Designed by            | Sun Microsystems          |
| Developer              | Oracle Corporation        |
| First appeared         | ; 20 years ago            |
| First appeared         |  (1995)                   |
| First appeared         | [1]                       |
| Stable release         | [2]                       |
| Stable release         | ; 24 days ago             |
| Stable release         |  (2015-07-14)             |
| Stable release         | [2]                       |
| Preview release        | ; 6 months ago            |
| Preview release        |  (2015-01-20)             |
| Typing discipline      | Static, strong, safe      |
| Typing discipline      | nominative                |
| Typing discipline      | manifest                  |
| Implementation language| C                         |
| Implementation language| C++                       |
| OS                     | Cross-platform            |
| License                | GNU General Public License|
| License                | Java Community Process    |
| Filename extensions    | .class                    |
| Filename extensions    | .jar                      |
| Website                | Official Site             |
| Website                | For Java Developers       |
+----------------------------------------------------+

#Disclaimer Scraping Websites is illegal in some countries. Please handle this tool with care!

This is a proof of concept and WIP. It may not be suited for production enviroments.

#Used tools

  • JSOUP for HTML parsing
  • Scala combination parsers for parsing the queries
  • Utilities like Guava, Lombok,...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.