Giter Site home page Giter Site logo

cascadia.jl's Introduction

Cascadia

Build Status

A CSS Selector library in Julia.

Inspired by, and mostly a direct translation of, the Cascadia CSS Selector library, written in Go, by @andybalhom.

This package depends on the Gumbo.jl package by @porterjamesj, which is a Julia wrapper around Google's Gumbo HTML parser library

Usage

Usage is simple. Use Gumbo to parse an HTML string into a document, create a Selector from a string, and then use matchall to get the nodes in the document that match the selector. Alternatively, use sel"<selector string>" to do the same thing as Selector. The matchall function returns an array of elements which match the selector. If no match is found, a zero element array is returned. For unique matches, the array contains one element. Thus, check the length of the array to test whether a selector matches.

using Cascadia
using Gumbo

n=parsehtml("<p id=\"foo\"><p id=\"bar\">")
s=Selector("#foo")
sm = sel"#foo"
matchall(s, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
#  Gumbo.HTMLElement{:p}

matchall(sm, n.root)
# 1-element Array{Gumbo.HTMLNode,1}:
#  Gumbo.HTMLElement{:p}

###Webscraping Example

The primary use case for this library is to enable webscraping -- the automatic extraction of information from html pages. As an example, consider the following code, which returns a list of questions that have been tagged with julia-lang on StackOverflow.

using Cascadia
using Gumbo
using Requests

r = get("http://stackoverflow.com/questions/tagged/julia-lang")
h=parsehtml(bytestring(r.data))

qs = matchall(Selector(".question-summary"),h.root)

println("StackOverflow Julia Questions (votes  answered?  url)")

for q in qs
    votes = nodeText(matchall(Selector(".votes .vote-count-post "), q)[1])
    answered = length(matchall(Selector(".status.answered"), q)) > 0
    href = matchall(Selector(".question-hyperlink"), q)[1].attributes["href"]
    println("$votes  $answered  http://stackoverflow.com$href")    
end

This code produces the following output:

StackOverflow Julia Questions (votes  answered?  url)

2  true  http://stackoverflow.com/questions/38688095/warning-base-writemime-is-deprecated-julia-0-5-with-jupyter
2  false  http://stackoverflow.com/questions/38687435/better-way-to-take-lots-of-dot-products
2  false  http://stackoverflow.com/questions/38686788/how-to-save-plots-with-the-correct-theme-local-font-using-gadfly-in-julia-lang
1  true  http://stackoverflow.com/questions/38680732/how-to-interpolate-into-a-julia-for-expression
1  true  http://stackoverflow.com/questions/38676573/whats-the-best-way-to-convert-an-int-to-a-string-in-julia
4  true  http://stackoverflow.com/questions/38671821/julia-non-destructively-update-immutable-type-variable
1  false  http://stackoverflow.com//questions/38663113/how-to-use-interpolations-on-sharedarray-in-worker-process-without-each-process
1  true  http://stackoverflow.com/questions/38647107/how-write-datatype-in-file-with-julia
3  false  http://stackoverflow.com/questions/38646014/julia-macro-expansion-order
1  true  http://stackoverflow.com/questions/38644939/how-can-i-get-the-system-process-id-of-a-running-external-command-in-julia
0  false  http://stackoverflow.com/questions/38638496/julia-serialize-error-when-sending-large-objects-to-workers
2  false  http://stackoverflow.com/questions/38628089/integrating-juno-ide-with-atom-editor-for-julia-in-windows
1  false  http://stackoverflow.com/questions/38626999/julia-surface-plot-custom-colors
2  false  http://stackoverflow.com/questions/38625663/subset-of-dictionary-with-aliases
2  false  http://stackoverflow.com/questions/38615552/remove-automatically-generated-color-key-in-gadfly-plot

Note that this returns the elements on the first page of the query results. Getting the values from subsequent pages is left as an exercise for the reader.

###Current Status

Most selector types are supported, but a few are still not fully functional. Examples of selectors that currently work, and some that don't yet, are listed below.

Selector Status
address Works
* Works
#foo Works
li#t1 Works
*#t4 Works
.t1 Works
p.t1 Works
div.teST Works
.t1.fail Works
p.t1.t2 Works
p[title] Works
address[title="foo"] Works
[ title ~= foo ] Works
[title~="hello world"] Works
`[lang ="en"]`
[title^="foo"] Works
[title$="bar"] Works
[title*="bar"] Works
.t1:not(.t2) Works
div:not(.t1) Works
li:nth-child(odd) Doesn't Work
li:nth-child(even) Doesn't Work
li:nth-child(-n+2) Doesn't Work
li:nth-child(3n+1) Doesn't Work
li:nth-last-child(odd) Doesn't Work
li:nth-last-child(even) Doesn't Work
li:nth-last-child(-n+2) Doesn't Work
li:nth-last-child(3n+1) Doesn't Work
span:first-child Doesn't Work
span:last-child Doesn't Work
p:nth-of-type(2) Doesn't Work
p:nth-last-of-type(2) Doesn't Work
p:last-of-type Doesn't Work
p:first-of-type Doesn't Work
p:only-child Doesn't Work
p:only-of-type Doesn't Work
:empty Works
div p Works
div table p Works
div > p Works
p ~ p Works
p + p Works
li, p Works
p +/*This is a comment*/ p Works
p:contains("that wraps") Works
p:containsOwn("that wraps") Works
:containsOwn("inner") Works
p:containsOwn("block") Works
div:has(#p1) Works
div:has(:containsOwn("2")) Works
body :has(:containsOwn("2")) Doesn't Work
body :haschild(:containsOwn("2")) Works
p:matches([\d]) Works
p:matches([a-z]) Works
p:matches([a-zA-Z]) Works
p:matches([^\d]) Works
`p:matches(^(0 a))`
p:matches(^\d+$) Works
p:not(:matches(^\d+$)) Works
div :matchesOwn(^\d+$) Works
[href#=(fina)]:not([href#=(\/\/[^\/]+untrusted)]) Doesn't Work
[href#=(^https:\/\/[^\/]*\/?news)] Doesn't Work
:input Works

cascadia.jl's People

Contributors

aviks avatar tkelman avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.