Giter Site home page Giter Site logo

montferret / ferret Goto Github PK

View Code? Open in Web Editor NEW
5.6K 103.0 301.0 3.96 MB

Declarative web scraping

Home Page: https://www.montferret.dev/

License: Apache License 2.0

Makefile 0.08% Go 79.88% ANTLR 0.51% HTML 17.78% JavaScript 1.74% CSS 0.01%
golang query-language data-mining scraping scraping-websites dsl cdp crawling scraper crawler

ferret's People

Contributors

agneum avatar ap0 avatar bundleman avatar clock21am avatar conago avatar davad avatar dependabot-preview[bot] avatar dependabot[bot] avatar eruca avatar esell avatar gabriel-marinkovic avatar jasonparekh avatar jbampton avatar junebugfix avatar kiinoo avatar krishnakarthik9 avatar loic5 avatar mikemaccana avatar panakour avatar pangoraw avatar pierrebrisorgueil avatar prateekchaplot avatar slowmanchan avatar sumlare avatar testwill avatar three7six avatar trheyi avatar tsukasai avatar u5surf avatar ziflex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ferret's Issues

Implement strings.Like()

The function must check whether the pattern search is contained in the string text, using wildcard matching.

LIKE("cart", "ca_t")   // true
LIKE("carrot", "ca_t") // false
LIKE("carrot", "ca%t") // true

LIKE("foo bar baz", "bar")   // false
LIKE("foo bar baz", "%bar%") // true
LIKE("bar", "%bar%")         // true

LIKE("FoO bAr BaZ", "fOo%bAz")       // false
LIKE("FoO bAr BaZ", "fOo%bAz", true) // true

Src
Docs

Fix browser launcher

If CLI is started with --cdp-launch flag, it should check whether Chrome is running and if not, open a new instance with --remote-debugging-port flag.

  • macOS
  • Linux
  • Windows

New to Go, Not working on Ubuntu 18

I'm new to this go stuff. I tried installing this on Ubuntu 18. First installing go, and then trying to make ferret... Would it be possible to post a completely newbie guide with all the command line steps? Would be much appreciated, thanks!

Add array functions

  • APPEND(anyArray, values, unique) → newArray
  • FIRST(anyArray) → firstElement
  • FLATTEN(anyArray, depth) → flatArray
  • INTERSECTION(array1, array2, ... arrayN) → newArray
  • LAST(anyArray) → lastElement
  • LENGTH(anyArray) → length
  • MINUS(array1, array2, ... arrayN) → newArray
  • NTH(anyArray, position) → nthElement
  • OUTERSECTION(array1, array2, ... arrayN) → newArray
  • POP(anyArray) → newArray
  • POSITION(anyArray, search, returnIndex) → position
  • PUSH(anyArray, value, unique) → newArray
  • REMOVE_NTH(anyArray, position) → newArray
  • REMOVE_VALUE(anyArray, value, limit) → newArray
  • REMOVE_VALUES(anyArray, values) → newArray
  • REVERSE(anyArray) → reversedArray
  • SHIFT(anyArray) → newArray
  • SLICE(anyArray, start, length) → newArray
  • SORTED(anyArray) → newArray
  • SORTED_UNIQUE(anyArray) → newArray
  • UNION(array1, array2, ... arrayN) → newArray
  • UNION_DISTINCT(array1, array2, ... arrayN) → newArray
  • UNIQUE(anyArray) → newArray
  • UNSHIFT(anyArray, value, unique) → newArray

Fix TravisCI

The problem is that I cannot install ANTLR4 properly in order to generate parser.

Integration tests

Currently, we run unit tests covering functionality that does not require a running browser.
As a result, the most complex and valuable functionality - work with dynamic pages - not covered by unit tests.
We need to build an infrastructure that would allow us to test dynamic pages with predictable results.
Here is a draft of how it can be implemented:

  • web server serving static files that would run during tests
  • set of static pages served by the server
  • dynamic page - single page application with possibility to interact (inputs, buttons, redirects on clicks)
  • setup TravisCI to run it with headless browser.

Add WAIT_ELEMENT function

Added WAIT_ELEMENT(doc, selector, timeout = 1000) function.
The function would stop execution until it finds elements by elector.
With default 1000ms timeout.

Add logging

We need a mechanism that would allows us to log all issues that are not handled (in API which does not return errors) properly.

We should pass a logger with a context.

Moreover, we need to be able to set custom output for the logger: std, file something else that implements Writer interface.

Add WAIT_EVENT and WAIT_ELEMENT_EVENT function

Add a function which waits for a certain event from a given document or an element.

WAIT_EVENT(doc, selector, eventName, timeout)

and

WAIT_ELEMENT_EVENT(docOrEl, eventName, timeout)

encoding issue for simple chinese gb2312

I need grab a gb2312 encoding html page, such as http://tour.sanya.gov.cn/News.asp, How could I covert the page body data from gb2312 to utf8 in the below code?

LET doc = DOCUMENT("http://tour.sanya.gov.cn/News.asp")

if don't do this. I can't convert the extra json data from gb2312 to UTF8. Any suggestion for this issue?

Add CLICK function

Add CLICK(el) function that would emit click event for a passed element.

CDP only.

Add numeric functions

  • ABS(value) → unsignedValue
  • ACOS(value) → num
  • ASIN(value) → num
  • ATAN(value) → num
  • ATAN2(y, x) → num
  • AVERAGE(numArray) → mean
  • CEIL(value) → roundedValue
  • COS(value) → num
  • DEGREES(rad) → num
  • EXP(value) → num
  • EXP2(value) → num
  • FLOOR(value) → roundedValue
  • LOG(value) → num
  • LOG2(value) → num
  • LOG10(value) → num
  • MAX(anyArray) → max
  • MEDIAN(numArray) → median
  • MIN(anyArray) → min
  • PERCENTILE(numArray, n, method) → percentile
  • PI() → pi
  • POW(base, exp) → num
  • RADIANS(deg) → num
  • RAND() → randomNumber
  • RANGE(start, stop, step) → numArray
  • ROUND(value) → roundedValue
  • SIN(value) → num
  • SQRT(value) → squareRoot
  • STDDEV_POPULATION(numArray) → num
  • STDDEV_SAMPLE(numArray) → num
  • SUM(numArray) → sum
  • TAN(value) → num
  • VARIANCE_POPULATION(numArray) → num
  • VARIANCE_SAMPLE(array) → num

https://docs.arangodb.com/3.4/AQL/Functions/Numeric.html

Add INPUT function.

Add a possibility to fill out forms using INPUT(el, value) /INPUT(doc, selector, value) function.

Improve UA generation

Is your feature request related to a problem? Please describe.
The current algorithm of UA generation gives us UA strings of very old browsers which lead to unpredicted results like wrong CSS selectors or inability to render a page at all.

Describe the solution you'd like
The solution we need is a list (ideally a smart generator) of modern browsers which has a wide variety of:

  • versions (modern ones)
  • browser types
  • platforms

Libraries to consider:

Implement string functions

  • FIND_FIRST(text, search, start, end) → position
  • FIND_LAST(text, search, start, end) → position
  • JSON_PARSE(text) → value
  • JSON_STRINGIFY(value) → text
  • LEFT(value, length) → substring
  • LENGTH(str) → length
  • LIKE(text, search, caseInsensitive) → bool
  • LOWER(value) → lowerCaseString
  • LTRIM(value, chars) → strippedString
  • REGEX_TEST(text, search, caseInsensitive) → bool
  • REGEX_REPLACE(text, search, replacement, caseInsensitive) → string
  • REVERSE(value) → reversedString
  • RIGHT(value, length) → substring
  • RTRIM(value, chars) → strippedString
  • SPLIT(value, separator, limit) → strArray
  • SUBSTITUTE(value, search, replace, limit) → substitutedString
  • SUBSTRING(value, offset, length) → substring
  • UPPER(value) → upperCaseString

GetOuterHTML: rpc error: Could not find node with given id

Describe the bug
Returns a given error when element actually exists.

To Reproduce
Appears randomly. You can run input.fql script from examples multiple times and notice that sometimes it returns an empty array.

Expected behavior
Should always find an element.

Additional context
That might be a problem either in cdp package or in Chrome itself.

Add WAIT_CLASS function

Add WAIT_CLASS function that would stop an execution until a given CSS class(es) appear in an element.

Signature:

WAIT_CLASS(document, selector, class...)

Where:

  • document - document object
  • selector - CSS selector to find an element as a class owner.
  • class - an arbitrary number of CSS classes as multiple arguments (at least 1) to wait for
WAIT_CLASS(element, class...)

Where:

  • element - element object
  • class - an arbitrary number of CSS classes as multiple arguments (at least 1) to wait for

Multi-page Requests

I'm having an issue with gathering links and then going through those links. I.e. say trying to get a list of articles, and then going to the actual article and getting title, content, etc. I can get the list of links, and I can scrape a specific page, but adding them together keeps timing out. Any ideas on what I'm doing wrong?

%
LET doc = DOCUMENT('https://www.theverge.com/tech', true)
WAIT_ELEMENT(doc, '.c-compact-river__entry', 5000)
LET articles = ELEMENTS(doc, '.c-entry-box--compact__image-wrapper')
LET links = (
    FOR article IN articles
        RETURN article.attributes.href
)
FOR link IN links
    NAVIGATE(doc, link)
    LET doc = DOCUMENT(link, true)
    WAIT_ELEMENT(doc, '.c-entry-content', 5000)
    LET texter = ELEMENT(doc, '.c-entry-content')
    RETURN texter.innerText
%

Add range operator

2010..2013

should produce the following result:

[ 2010, 2011, 2012, 2013 ]

Add PDF function

Add a function which converts an open page into PDF and returns it as binary data.

The method is supposed to be handled by dynamic web driver.
There are two signatures:

PDF(document) -> binary
and
PDF(url) -> binary

and it should use cdp client API.

Confused about "quick-start"

After I install ferret by

go get github.com/MontFerret/ferret

It's seems didn't work by just type ferret
screen shot 2018-10-06 at 3 54 33 pm
However.It works by

go run ./main.go

screen shot 2018-10-06 at 4 00 33 pm

Am I not install it correctly?

Implement COLLECT keyword

COLLECT variableName = expression options
COLLECT variableName = expression INTO groupsVariable options
COLLECT variableName = expression INTO groupsVariable = projectionExpression options
COLLECT variableName = expression INTO groupsVariable KEEP keepVariable options
COLLECT variableName = expression WITH COUNT INTO countVariable options
COLLECT variableName = expression AGGREGATE variableName = aggregateExpression options
COLLECT AGGREGATE variableName = aggregateExpression options
COLLECT WITH COUNT INTO countVariable options
  • Grouping
  • Grouping with projection
  • Grouping with counting
  • Grouping with aggregation
  • Aggregation
  • Counting

https://docs.arangodb.com/3.4/AQL/Operations/Collect.html

Add object functions

  • KEYS(object, sort) → strArray
  • HAS(object, keyName) → isPresent
    LENGTH(object) → count (Implemented here)
  • MERGE(object1, object2, ... objectN) → newMergedObject
  • MERGE_RECURSIVE(object1, object2, ... objectN) → newMergedObject
  • VALUES(document, removeInternal) → anyArray
  • ZIP(keys, values) → newObj
  • KEEP(object, key1, key2, ... key) → newObj

Bad user agent

Currently, we generate random user agent for each document.
The problem is that underlying library sometimes gives us user agent string of an old browser, which might be not supported by scraped website.

That leads to unpredicted results like this:

screenshot 2018-10-03 12 05 20

We need to narrow down the list of user agents that represent modern browsers.
Maybe we could switch to this library: https://github.com/avct/uasurfer

Update:
Probably, it makes sense to extend the functionality in a following manner:

  • by default it's static
  • by passing a string - use it as a new UA value
  • by passing * as a UA string - enables UA string generation

Looks good?

Go vet fail when building using CirlceCI

When I build the project using CircleCI I'm getting this error for the command go vet :

make build
dep ensure
go vet ./cmd/... ./pkg/...

github.com/{{ORG_NAME}}/{{REPO_NAME}}/pkg/runtime/values_test [github.com/{{ORG_NAME}}/{{REPO_NAME}}/pkg/runtime/values.test]

pkg/runtime/values/array_test.go:266:10: invalid argument s (type *values.Array) for len
pkg/runtime/values/array_test.go:267:8: invalid operation: s[0] (type *values.Array does not support indexing)
pkg/runtime/values/array_test.go:271:10: invalid argument s2 (type *values.Array) for len
Makefile:43: recipe for target 'vet' failed
make: *** [vet] Error 2
Exited with code 2

Any idea why ?

Open new pages in incognito

Currently, ferret opens pages in regular mode which makes things more difficult since it uses all user session cookies.
We need to change it. Open pages in incognito mode.

Read FQL from stdin

It would be great to be able to read a query from STDIN.

e.g.

echo "some query" | go run ./cmd/cli/main.go

or

cat someQueryFile.fql | go run ./cmd/cli/main.go

or

go run ./cmd/cli/main.go < someQueryFile.fql

Add ternary operator

u.age > 15 || u.active == true ? u.userId : null

There is also a shortcut variant of the ternary operator with just two operands. This variant can be used when the expression for the boolean condition and the return value should be the same:

u.value ? : 'value is null, 0 or not present'

searching via a unique element like ng-repeat="service in services | filter: specialismFilter "

This TR below is inside a table, and i cant find out out to address each. I need to use the ng-repeat as thats the only thing thats unique to them. class="ng-scope" is useless because thats everywhere in the page.

SO how can i address a unique Element like "ng-repeat" ?

<tr ng-repeat="service in services | filter: specialismFilter " class="ng-scope">
     <td class="ng-binding">Renewable Technology</td>
     <td><input type="checkbox" ng-model="service.checked" ng-change="specialistsChecked(service)" ng-disabled="specialistsSelected == checkedLimit &amp;&amp; !service.checked" class="ng-pristine ng-untouched ng-valid ng-empty"></td>
</tr>

Add PDF function

Add PDF(doc) -> binary function that would make a PDF of the passed document.

Scrapoxy

Hello,

Could be interesting to integrate Scrapoxy (http://scrapoxy.io).

You can have your own distributed network of proxies.

A cool mixed feature should be the capacity of running a scenario with a specific proxy of the mesh proxy (you select the output node to have a scenario from the same outside IP).

Best regards,
Fabien.

param test broken?

Describe the bug
It appears that line 47 might be broken in pkg/runtime/core/param_test.go.

pf, err = core.ParamsFrom(ctx) is being run and the test assumes err is not nil. The results say otherwise :)

Failures:

  * /Users/esell/go/src/github.com/esell/ferret/pkg/runtime/core/param_test.go 
  Line 47:
  Expected '<nil>' to NOT be nil (but it was)!

To Reproduce
Steps to reproduce the behavior:

  1. Go to pkg/runtime/core
  2. Run go test
  3. See error

Expected behavior
to pass

Desktop (please complete the following information):

  • OS: macOS
  • Browser: firefox
  • Version: 62.0.3

More unit tests

We need more unit tests in runtime package.

  • pkg/runtime/core
  • pkg/runtime/expressions
  • pkg/runtime/expressions/operators
  • pkg/runtime/expressions/literals
  • pkg/runtime/expressions/clauses

Add date functions

  • NOW() -> datetime
  • DATE(str) → datetime
  • DATE_DAYOFWEEK(date) → weekdayNumber
  • DATE_YEAR(date) → year
  • DATE_MONTH(date) → month
  • DATE_DAY(date) → day
  • DATE_HOUR(date) → hour
  • DATE_MINUTE(date) → minute
  • DATE_SECOND(date) → second
  • DATE_MILLISECOND(date) → millisecond
  • DATE_DAYOFYEAR(date) → dayOfYear
  • DATE_LEAPYEAR(date) → leapYear
  • DATE_QUARTER(date) → quarter
  • DATE_DAYS_IN_MONTH(date) → daysInMonth
  • DATE_FORMAT(date, format) → str
  • DATE_ADD(date, amount, unit) → datetime
  • DATE_SUBTRACT(date, amount, unit) → datetime
  • DATE_DIFF(date1, date2, unit, asFloat) → diff
  • DATE_COMPARE(date1, date2, unitRangeStart, unitRangeEnd) → bool

Add array comparison operators

[ 1, 2, 3 ] ALL IN [ 2, 3, 4 ]   // false
[ 1, 2, 3 ] ALL IN [ 1, 2, 3 ]   // true
[ 1, 2, 3 ] NONE IN [ 3 ]        // false
[ 1, 2, 3 ] NONE IN [ 23, 42 ]   // true
[ 1, 2, 3 ] ANY IN [ 4, 5, 6 ]   // false
[ 1, 2, 3 ] ANY IN [ 1, 42 ]     // true
[ 1, 2, 3 ] ANY == 2             // true
[ 1, 2, 3 ] ANY == 4             // false
[ 1, 2, 3 ] ANY > 0              // true
[ 1, 2, 3 ] ANY <= 1             // true
[ 1, 2, 3 ] NONE < 99            // false
[ 1, 2, 3 ] NONE > 10            // true
[ 1, 2, 3 ] ALL > 2              // false
[ 1, 2, 3 ] ALL > 0              // true
[ 1, 2, 3 ] ALL >= 3             // false
["foo", "bar"] ALL != "moo"      // true
["foo", "bar"] NONE == "bar"     // false
["foo", "bar"] ANY == "foo"      // true

distributed

lots of servers block you if you make too many requests at a time.

So solution is either to put a wait in OR run the script from many cloud servers.
I like the 2nd option

It would mean that each one runs in a command and control fashion i think.
a central brain controls them all, telling them the exact next step, but each shares the same cookie etc for authenticated scraping.

Raising this as an idea...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.