Giter Site home page Giter Site logo

s0rg / crawley Goto Github PK

View Code? Open in Web Editor NEW
246.0 2.0 14.0 249 KB

The unix-way web crawler

License: MIT License

Makefile 1.22% Go 98.78%
cli crawler golang-application unix-way web-scraping web-spider golang web-crawler pentest-tool go pentest pentesting

crawley's Introduction

License FOSSA Status Go Version Release Mentioned in Awesome Go Downloads

CI Go Report Card Maintainability Test Coverage libraries.io Issues

crawley

Crawls web pages and prints any link it can find.

features

  • fast html SAX-parser (powered by x/net/html)
  • js/css lexical parsers (powered by tdewolff/parse) - extract api endpoints from js code and url() properties
  • small (below 1500 SLOC), idiomatic, 100% test covered codebase
  • grabs most of useful resources urls (pics, videos, audios, forms, etc...)
  • found urls are streamed to stdout and guranteed to be unique (with fragments omitted)
  • scan depth (limited by starting host and path, by default - 0) can be configured
  • can be polite - crawl rules and sitemaps from robots.txt
  • brute mode - scan html comments for urls (this can lead to bogus results)
  • make use of HTTP_PROXY / HTTPS_PROXY environment values + handles proxy auth (use HTTP_PROXY="socks5://127.0.0.1:1080/" crawley for socks5)
  • directory-only scan mode (aka fast-scan)
  • user-defined cookies, in curl-compatible format (i.e. -cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file)
  • user-defined headers, same as curl: -header "ONE: 1" -header "TWO: 2" -header @headers-file
  • tag filter - allow to specify tags to crawl for (single: -tag a -tag form, multiple: -tag a,form, or mixed)
  • url ignore - allow to ignore urls with matched substrings from crawling (i.e.: -ignore logout)
  • subdomains support - allow depth crawling for subdomains as well (e.g. crawley http://some-test.site will be able to crawl http://www.some-test.site)

examples

# print all links from first page:
crawley http://some-test.site

# print all js files and api endpoints:
crawley -depth -1 -tag script -js http://some-test.site

# print all endpoints from js:
crawley -js http://some-test.site/app.js

# download all png images from site:
crawley -depth -1 -tag img http://some-test.site | grep '\.png$' | wget -i -

# fast directory traversal:
crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site

installation

  • binaries / deb / rpm for Linux, FreeBSD, macOS and Windows.
  • archlinux you can use your favourite AUR helper to install it, e. g. paru -S crawley-bin.

usage

crawley [flags] url

possible flags with default values:

-all
    scan all known sources (js/css/...)
-brute
    scan html comments
-cookie value
    extra cookies for request, can be used multiple times, accept files with '@'-prefix
-css
    scan css for urls
-delay duration
    per-request delay (0 - disable) (default 150ms)
-depth int
    scan depth (set -1 for unlimited)
-dirs string
    policy for non-resource urls: show / hide / only (default "show")
-header value
    extra headers for request, can be used multiple times, accept files with '@'-prefix
-headless
    disable pre-flight HEAD requests
-ignore value
    patterns (in urls) to be ignored in crawl process
-js
    scan js code for endpoints
-proxy-auth string
    credentials for proxy: user:password
-robots string
    policy for robots.txt: ignore / crawl / respect (default "ignore")
-silent
    suppress info and error messages in stderr
-skip-ssl
    skip ssl verification
-subdomains
    support subdomains (e.g. if www.domain.com found, recurse over it)
-tag value
    tags filter, single or comma-separated tag names
-timeout duration
    request timeout (min: 1 second, max: 10 minutes) (default 5s)
-user-agent string
    user-agent string
-version
    show version
-workers int
      number of workers (default - number of CPU cores)

flags autocompletion

Crawley can handle flags autocompletion in bash and zsh via complete:

complete -C "/full-path-to/bin/crawley" crawley

license

FOSSA Status

crawley's People

Contributors

dependabot[bot] avatar fossabot avatar juxuanu avatar marybonilla2231 avatar s0rg avatar thenbe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

crawley's Issues

Support ignoring URL params

Add flag to ignore scraping same URL but with different params.
For example, while scraping a website https://abc.com, the flag will disable scraping both https://abc.com/something.php?lang=en and https://abc.com/something.php?lang=ru, since they are the same page but with different params.

Thanks!

cannot parse string as cookie

I tried several commands:

  • (with spaces) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX; phpbb3_ddu4final_sid=XXXXX; phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (without spaces) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;phpbb3_ddu4final_sid=XXXXX;phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (single value) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (separated values) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;" -cookie "phpbb3_ddu4final_sid=XXXXX;" -cookie "phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (without other parameters) crawley -cookie "phpbb3_ddu4final_k=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
    The result is every time "cannot parse the string as cookie":
2023/03/29 23:36:18 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/29 23:36:18 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/03/29 23:36:18 cannot parse 'phpbb3_ddu4final_k=XXXXX; phpbb3_ddu4final_sid=XXXXX; phpbb3_ddu4final_u=XXXXX;' as cookie, expected format: 'key=value;' as in curl
2023/03/29 23:36:20 [*] complete

It works when I don't use semicolon:

  • (single value) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt

Crawley v1.5.12-a1f6de2 (archlinux)

Just a suggestion

  1. For latest release, the current download URL for binary is:
    https://github.com/s0rg/crawley/releases/latest/download/crawley_v1.7.7_$(uname)_$(uname%20-m).tar.gz
    if possible please rename all the release binary files from "crawley_v1.7.7_$(uname)_$(uname%20-m).tar.gz" to "crawley_$(uname)_$(uname -m).tar.gz" (without the version), it will be easy to download from a single URL for example:
    eg: wget https://github.com/s0rg/crawley/releases/latest/download/crawley_$(uname)_$(uname -m).tar.gz

  2. Also add "go install -v https://github.com/s0rg/crawley/cmd/crawley@latest" for clean package path download/setup.

panic: runtime error: invalid memory address or nil pointer dereference

Command: ./crawley -all -user-agent "Mozilla/5.0" -subdomains -headless -depth -1 -silent -skip-ssl -workers 50 -timeout 10s -robots crawl https://target.tld

ERROR:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5778b2]

goroutine 44 [running]:
net/url.(*URL).ResolveReference(0xc0004b2750, 0x0)
/opt/hostedtoolcache/go/1.22.3/x64/src/net/url/url.go:1087 +0x32
github.com/s0rg/crawley/internal/crawler.(*Crawler).process.func1({0xc0005e2000, 0x556d})
/home/runner/work/crawley/crawley/internal/crawler/crawler.go:332 +0xd2
github.com/s0rg/crawley/internal/links.ExtractCSS({0x7f05a3350028?, 0xc00068e680?}, 0xc0002c9de0)
/home/runner/work/crawley/crawley/internal/links/css.go:26 +0x65
github.com/s0rg/crawley/internal/crawler.(*Crawler).process(0xc0000b2780, {0x752b20?, 0xc0004ca4d0?}, {0x751a08?, 0xc00013a000?}, 0xc0008b6900, {0xc00077c500, 0x59})
/home/runner/work/crawley/crawley/internal/crawler/crawler.go:355 +0x431
github.com/s0rg/crawley/internal/crawler.(*Crawler).worker(0xc0000b2780, {0x751a08, 0xc00013a000})
/home/runner/work/crawley/crawley/internal/crawler/crawler.go:390 +0x496
created by github.com/s0rg/crawley/internal/crawler.(*Crawler).Run in goroutine 1
/home/runner/work/crawley/crawley/internal/crawler/crawler.go:104 +0x31a

Deadlock error

I want to scrape this forum www.invitehawk.com but I get this deadlock error. I know it's a cookie error. In this case I can scrape even without cookies but I'd like to know che correct cookie syntax. I'm using netscape cookie format.

โžœ crawley --cookie invitehawk_ck.txt -depth -1 -dirs only "https://www.invitehawk.com/topic/126147-tracker-review-index-sorted-by-category/" | grep -E '.*20[0-9][0-9]-review' > url.list.ih.txt 

2023/03/19 10:17:11 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/19 10:17:11 [*] crawling url: https://www.invitehawk.com/topic/126147-tracker-review-index-sorted-by-category/
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x0?)
	/opt/hostedtoolcache/go/1.20.0/x64/src/runtime/sema.go:62 +0x27
sync.(*WaitGroup).Wait(0xc0000b01e0?)
	/opt/hostedtoolcache/go/1.20.0/x64/src/sync/waitgroup.go:116 +0x4b
github.com/s0rg/crawley/pkg/crawler.(*Crawler).close(0xc0000b2840)
	/home/runner/work/crawley/crawley/pkg/crawler/crawler.go:201 +0x65
panic({0x6c1960, 0xc0000c20c0})
	/opt/hostedtoolcache/go/1.20.0/x64/src/runtime/panic.go:884 +0x213
github.com/s0rg/crawley/pkg/client.parseOne({0x7ffc5e21c176?, 0x11?})
	/home/runner/work/crawley/crawley/pkg/client/cookie.go:35 +0x13d
github.com/s0rg/crawley/pkg/client.prepareCookies({0xc00009e430?, 0x1, 0xc000170000?})
	/home/runner/work/crawley/crawley/pkg/client/cookie.go:17 +0x13c
github.com/s0rg/crawley/pkg/client.New({0xc0000cc280, 0x3e}, 0x4, 0x0, {0xc00009e440, 0x1, 0x1}, {0xc00009e430, 0x1, 0x1})
	/home/runner/work/crawley/crawley/pkg/client/http.go:57 +0x1f8
github.com/s0rg/crawley/pkg/crawler.(*Crawler).Run(0xc0000b2840, {0x7ffc5e21c19d, 0x50}, 0x70ee68)
	/home/runner/work/crawley/crawley/pkg/crawler/crawler.go:99 +0x29b
main.crawl({0x7ffc5e21c19d, 0x50}, {0xc0000b4200?, 0x0?, 0xc0000406c8?})
	/home/runner/work/crawley/crawley/cmd/crawley/main.go:94 +0xe5
main.main()
	/home/runner/work/crawley/crawley/cmd/crawley/main.go:235 +0x188

cookies loaded but login page detected

Hello, when I use this command:
crawley -depth -1 -dirs only -cookie "phpbb3_XXXXXX_sid=XXXXXX; phpbb3_XXXXXX_u=XXXXXX; phpbb3_XXXXXX_k=XXXXXX;" "https://XXXXXX.net/viewtopic.php?p=5018859" > urls.txt
crawley scrapes the login page of the forum and not the thread I selected, crawley doesn't return any errors

2023/03/20 16:08:58 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/20 16:08:58 [*] crawling url: https://XXXXXX.net/viewtopic.php?p=5018859
2023/03/20 16:09:00 [*] complete

I tried with different user agents and headers, every time the result is the same: login page of the forum. The cookies are ok, I copied them using EditThisCookie googlechrome extension. I'm not using any vpn/proxy.
I tested also other forums, the result is always the same, I can't login.
Do you know if there is a problem loading cookies?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.