Giter Site home page Giter Site logo

urlgrab's Introduction

Welcome to urlgrab ๐Ÿ‘‹

Twitter: DevinStokes

A golang utility to spider through a website searching for additional links with support for JavaScript rendering.

Install

go get -u github.com/iamstoxe/urlgrab

Features

  • Customizable Parallelism
  • Ability to Render JavaScript (including Single Page Applications such as Angular and React)

Usage

Usage of urlgrab:
  -cache-dir string
        Specify a directory to utilize caching. Works between sessions as well.
  -debug
        Extremely verbose debugging output. Useful mainly for development.
  -delay int
        Milliseconds to randomly apply as a delay between requests. (default 2000)
  -depth int
        The maximum limit on the recursion depth of visited URLs.  (default 2)
  -headless
        If true the browser will be displayed while crawling.
        Note: Requires render-js flag
        Note: Usage to show browser: --headless=false (default true)
  -ignore-query
        Strip the query portion of the URL before determining if we've visited it yet.
  -ignore-ssl
        Scrape pages with invalid SSL certificates
  -js-timeout int
        The amount of seconds before a request to render javascript should timeout. (default 10)
  -json string
        The filename where we should store the output JSON file.
  -max-body int
        The limit of the retrieved response body in kilobytes.
        0 means unlimited.
        Supply this value in kilobytes. (i.e. 10 * 1024kb = 10MB) (default 10240)
  -no-head
        Do not send HEAD requests prior to GET for pre-validation.
  -output-all string
        The directory where we should store the output files.
  -proxy string
        The SOCKS5 proxy to utilize (format: socks5://127.0.0.1:8080 OR http://127.0.0.1:8080).
        Supply multiple proxies by separating them with a comma.
  -random-agent
        Utilize a random user agent string.
  -render-js
        Determines if we utilize a headless chrome instance to render javascript.
  -root-domain string
        The root domain we should match links against.
        If not specified it will default to the host of --url.
        Example: --root-domain google.com
  -threads int
        The number of threads to utilize. (default 5)
  -timeout int
        The amount of seconds before a request should timeout. (default 10)
  -url string
        The URL where we should start crawling.
  -urls string
        A file path that contains a list of urls to supply as starting urls.
        Requires --root-domain flag.
  -user-agent string
        A user agent such as (Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0).
  -verbose
        Verbose output

Build

You can easily build a binary specific to your platform into the bin directory with th following command:

make build

if you want to make binaries for Windows, Linux and MacOS to distribute the CLI, just run this command:

make cross

All the binaries will be available in the dist directory.

Author

๐Ÿ‘ค Devin Stokes

๐Ÿค Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a โญ if this project helped you!

Buy Me A Coffee

urlgrab's People

Contributors

glours avatar iamstoxe avatar jay51 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

urlgrab's Issues

Cookie Support

It would be great if you could specify cookies that are used for each request in urlgrab.
Maybe even have a session that adds new cookies along the way (for csrf nonces, etc.)

Find a way to output information more concisely.

There are many facets to the information gathered here, not limited to:

  1. Total overall URLs found linked within the sites
  2. Total discovered URLs of which are a part of the root domain of our starting URL
  3. The count of such occurrences of both items above.

There must be a way to output this information in a way that is beneficial for all users where it only outputs what we want. This may require a CLI reworking (including possibly a CLI framework such as cobra).

This will require thought. For now I will see what I can do with the existing system.

Getting Error [Page Collector] Max Depth Reached

Seeing the following error:

โฏ urlgrab -headless -url fool.com
09:54:45.970 โ–  main โ–ถ INFO Domain: fool.com
09:54:47.890 โ–  func3 โ–ถ ERROR [Page Collector] Max Depth Reached
09:54:47.890 โ–  func3 โ–ถ ERROR [Page Collector] Max Depth Reached

License?

Respected author,

Please add the license

Kind regards,
Milan

go get error (unrecongnized import path)

Hi :D
As from the README, I tried to install through go get, but the following error occurs.

case1
OS: MacOS 10.15.5
Go version: go1.15 darwin/amd64

case2
OS: ubuntu 20.04
Go version: go1.13.8 linux/amd64

Same result

$ go get -u github.com/iamstoxe/urlgrab
package urlgrab/browser: unrecognized import path "urlgrab/browser": import path does not begin with hostname
package urlgrab/utilities: unrecognized import pat

Um, problem is using the local package. Is there any reason for configuring it locally?

	"urlgrab/browser"
	. "urlgrab/utilities"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.