projectdiscovery / katana Goto Github PK

View Code? Open in Web Editor NEW

8.7K 77.0 462.0 1.44 MB

A next-generation crawling and spidering framework.

License: MIT License

Go 99.32% Dockerfile 0.14% Makefile 0.18% Shell 0.36%

crawler web-spider gocrawler spider-framework cli headless

katana's People

Contributors

Stargazers

Watchers

Forkers

gprime31 onepiece-max bahlawi89 akbruster imlaoji ikpehlivan techris45 amirwhitehat meaebi magnologan 5l1v3r1 hxlxmjxbbxs afwu araselmir orleven hackerxj007 root4rce lanzsec munjurhasan444 in1t0 changheluor007 nesiht aidasdir nck0099 rlima999 hung mschader spongyb onchere evcuq4hggjd74lhz cr4ck32 knightcn1983 sajjadanwar0 0xrobert datnt78 madusec righteousgambit dekeeu saif-programmer nocomp im-hanzou dayutry momomasilia pranay7700 sec-fork ziduhuihai lalkaltest longlongji raindouble explangcn cviorel xhardc0re atlassion llabi nanderoo jrcribb calebalem tehmasta busterdomo olaf88 richardsonjf cheungyoung jinbinhan sala2000 emptysec1 feliperochadev illiagoldin tomb-rider migzone liansweb testitok 0610thomas silocityit anylayer rudycomcast irwpb exiahan uakbr noscripter amiimor kugoucode william3johnson inferno-inc pavandeore weisk marcosantonastasi mznsolucoes reynoldsm88 pent thearchiver yonasbsd ericdunham2 mvjq zzhsec forforkssake joooox1 gavinljj jgui1129 zhangkaiitugithub spread0x

katana's Issues

STDIN URL input support

Currently, input can be supplied with -u or -list option that can be extended to support stdin as well.

echo https://www.hackerone.com | ./katana 

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[FTL] Could not process: could not create runner: could not validate options: no inputs specified for crawler

Implement headless crawler mechanism

Description

Implement the crawler code based on the Design accepted by the #4 issue.

Project Version

dev

Please describe your feature request:

Regex improvements for endpoint extraction.

echo https://projectdiscovery.io | ./katana -jc -cs projectdiscovery.io

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://projectdiscovery.io/app.js
https://projectdiscovery.io/
https://projectdiscovery.io/moment.js
https://projectdiscovery.io/Underscore.js
https://projectdiscovery.io/a/i
https://projectdiscovery.io/a/b
https://projectdiscovery.io/e.do
https://projectdiscovery.io/n.do
https://projectdiscovery.io/af
https://projectdiscovery.io/af.js
https://projectdiscovery.io/ar
https://projectdiscovery.io/ar-dz
https://projectdiscovery.io/ar-dz.js
https://projectdiscovery.io/ar-kw
https://projectdiscovery.io/ar-kw.js
https://projectdiscovery.io/ar-ly
https://projectdiscovery.io/ar-ly.js
https://projectdiscovery.io/ar-ma
https://projectdiscovery.io/ar-ma.js
https://projectdiscovery.io/ar-sa
https://projectdiscovery.io/ar-sa.js
https://projectdiscovery.io/ar-tn
https://projectdiscovery.io/ar-tn.js
https://projectdiscovery.io/ar.js
https://projectdiscovery.io/az
https://projectdiscovery.io/az.js
https://projectdiscovery.io/be
https://projectdiscovery.io/be.js
https://projectdiscovery.io/bg
https://projectdiscovery.io/bg.js
https://projectdiscovery.io/bm
https://projectdiscovery.io/bm.js
https://projectdiscovery.io/bn
https://projectdiscovery.io/bn-bd
https://projectdiscovery.io/bn-bd.js
https://projectdiscovery.io/bn.js
https://projectdiscovery.io/bo
https://projectdiscovery.io/bo.js
https://projectdiscovery.io/br
https://projectdiscovery.io/br.js
https://projectdiscovery.io/bs
https://projectdiscovery.io/bs.js
https://projectdiscovery.io/ca
https://projectdiscovery.io/ca.js
https://projectdiscovery.io/cs
https://projectdiscovery.io/cs.js
https://projectdiscovery.io/cv
https://projectdiscovery.io/cv.js
https://projectdiscovery.io/cy
https://projectdiscovery.io/cy.js
https://projectdiscovery.io/da
https://projectdiscovery.io/da.js
https://projectdiscovery.io/de
https://projectdiscovery.io/de-at
https://projectdiscovery.io/de-at.js
https://projectdiscovery.io/de-ch
https://projectdiscovery.io/de-ch.js
https://projectdiscovery.io/de.js
https://projectdiscovery.io/dv
https://projectdiscovery.io/dv.js
https://projectdiscovery.io/el
https://projectdiscovery.io/el.js
https://projectdiscovery.io/en-au
https://projectdiscovery.io/en-au.js
https://projectdiscovery.io/en-ca
https://projectdiscovery.io/en-ca.js
https://projectdiscovery.io/en-gb
https://projectdiscovery.io/en-gb.js
https://projectdiscovery.io/en-ie
https://projectdiscovery.io/en-ie.js
https://projectdiscovery.io/en-il
https://projectdiscovery.io/en-il.js
https://projectdiscovery.io/en-in
https://projectdiscovery.io/en-in.js
https://projectdiscovery.io/en-nz
https://projectdiscovery.io/en-nz.js
https://projectdiscovery.io/en-sg
https://projectdiscovery.io/en-sg.js
https://projectdiscovery.io/eo
https://projectdiscovery.io/eo.js
https://projectdiscovery.io/es
https://projectdiscovery.io/es-do
https://projectdiscovery.io/es-do.js
https://projectdiscovery.io/es-mx
https://projectdiscovery.io/es-mx.js
https://projectdiscovery.io/es-us
https://projectdiscovery.io/es-us.js
https://projectdiscovery.io/es.js
https://projectdiscovery.io/et
https://projectdiscovery.io/et.js
https://projectdiscovery.io/eu
https://projectdiscovery.io/eu.js
https://projectdiscovery.io/fa
https://projectdiscovery.io/fa.js
https://projectdiscovery.io/fi
https://projectdiscovery.io/fi.js
https://projectdiscovery.io/fil
https://projectdiscovery.io/fil.js
https://projectdiscovery.io/fo
https://projectdiscovery.io/fo.js
https://projectdiscovery.io/fr
https://projectdiscovery.io/fr-ca
https://projectdiscovery.io/fr-ca.js
https://projectdiscovery.io/fr-ch
https://projectdiscovery.io/fr-ch.js
https://projectdiscovery.io/fr.js
https://projectdiscovery.io/fy
https://projectdiscovery.io/fy.js
https://projectdiscovery.io/ga
https://projectdiscovery.io/ga.js
https://projectdiscovery.io/gd
https://projectdiscovery.io/gd.js
https://projectdiscovery.io/gl
https://projectdiscovery.io/gl.js
https://projectdiscovery.io/gom-deva
https://projectdiscovery.io/gom-deva.js
https://projectdiscovery.io/gom-latn
https://projectdiscovery.io/gom-latn.js
https://projectdiscovery.io/gu
https://projectdiscovery.io/gu.js
https://projectdiscovery.io/he
https://projectdiscovery.io/he.js
https://projectdiscovery.io/hi
https://projectdiscovery.io/hi.js
https://projectdiscovery.io/hr
https://projectdiscovery.io/hr.js
https://projectdiscovery.io/hu
https://projectdiscovery.io/hu.js
https://projectdiscovery.io/hy-am
https://projectdiscovery.io/hy-am.js
https://projectdiscovery.io/id
https://projectdiscovery.io/id.js
https://projectdiscovery.io/is
https://projectdiscovery.io/is.js
https://projectdiscovery.io/it
https://projectdiscovery.io/it-ch
https://projectdiscovery.io/it-ch.js
https://projectdiscovery.io/it.js
https://projectdiscovery.io/ja
https://projectdiscovery.io/ja.js
https://projectdiscovery.io/jv
https://projectdiscovery.io/jv.js
https://projectdiscovery.io/ka
https://projectdiscovery.io/ka.js
https://projectdiscovery.io/kk
https://projectdiscovery.io/kk.js
https://projectdiscovery.io/km
https://projectdiscovery.io/km.js
https://projectdiscovery.io/kn
https://projectdiscovery.io/kn.js
https://projectdiscovery.io/ko
https://projectdiscovery.io/ko.js
https://projectdiscovery.io/ku
https://projectdiscovery.io/ku.js
https://projectdiscovery.io/ky
https://projectdiscovery.io/ky.js
https://projectdiscovery.io/lb
https://projectdiscovery.io/lb.js
https://projectdiscovery.io/lo
https://projectdiscovery.io/lo.js
https://projectdiscovery.io/lt
https://projectdiscovery.io/lt.js
https://projectdiscovery.io/lv
https://projectdiscovery.io/lv.js
https://projectdiscovery.io/me
https://projectdiscovery.io/me.js
https://projectdiscovery.io/mi
https://projectdiscovery.io/mi.js
https://projectdiscovery.io/mk
https://projectdiscovery.io/mk.js
https://projectdiscovery.io/ml
https://projectdiscovery.io/ml.js
https://projectdiscovery.io/mn
https://projectdiscovery.io/mn.js
https://projectdiscovery.io/mr
https://projectdiscovery.io/mr.js
https://projectdiscovery.io/ms
https://projectdiscovery.io/ms-my
https://projectdiscovery.io/ms-my.js
https://projectdiscovery.io/ms.js
https://projectdiscovery.io/mt
https://projectdiscovery.io/mt.js
https://projectdiscovery.io/my
https://projectdiscovery.io/my.js
https://projectdiscovery.io/nb
https://projectdiscovery.io/nb.js
https://projectdiscovery.io/ne
https://projectdiscovery.io/ne.js
https://projectdiscovery.io/nl
https://projectdiscovery.io/nl-be
https://projectdiscovery.io/nl-be.js
https://projectdiscovery.io/nl.js
https://projectdiscovery.io/nn
https://projectdiscovery.io/nn.js
https://projectdiscovery.io/oc-lnc
https://projectdiscovery.io/oc-lnc.js
https://projectdiscovery.io/pa-in
https://projectdiscovery.io/pa-in.js
https://projectdiscovery.io/pl
https://projectdiscovery.io/pl.js
https://projectdiscovery.io/pt
https://projectdiscovery.io/pt-br
https://projectdiscovery.io/pt-br.js
https://projectdiscovery.io/pt.js
https://projectdiscovery.io/ro
https://projectdiscovery.io/ro.js
https://projectdiscovery.io/ru
https://projectdiscovery.io/ru.js
https://projectdiscovery.io/sd
https://projectdiscovery.io/sd.js
https://projectdiscovery.io/se
https://projectdiscovery.io/se.js
https://projectdiscovery.io/si
https://projectdiscovery.io/si.js
https://projectdiscovery.io/sk
https://projectdiscovery.io/sk.js
https://projectdiscovery.io/sl
https://projectdiscovery.io/sl.js
https://projectdiscovery.io/sq
https://projectdiscovery.io/sq.js
https://projectdiscovery.io/sr
https://projectdiscovery.io/sr-cyrl
https://projectdiscovery.io/sr-cyrl.js
https://projectdiscovery.io/sr.js
https://projectdiscovery.io/ss
https://projectdiscovery.io/ss.js
https://projectdiscovery.io/sv
https://projectdiscovery.io/sv.js
https://projectdiscovery.io/sw
https://projectdiscovery.io/sw.js
https://projectdiscovery.io/ta
https://projectdiscovery.io/ta.js
https://projectdiscovery.io/te
https://projectdiscovery.io/te.js
https://projectdiscovery.io/tet
https://projectdiscovery.io/tet.js
https://projectdiscovery.io/tg
https://projectdiscovery.io/tg.js
https://projectdiscovery.io/th
https://projectdiscovery.io/th.js
https://projectdiscovery.io/tk
https://projectdiscovery.io/tk.js
https://projectdiscovery.io/tl-ph
https://projectdiscovery.io/tl-ph.js
https://projectdiscovery.io/tlh
https://projectdiscovery.io/tlh.js
https://projectdiscovery.io/tr
https://projectdiscovery.io/tr.js

Please describe your feature request:

katana/pkg/utils/formfill.go

Lines 21 to 26 in 3c543ad

	var DefaultFormFillData = FormFillData{
	Email: "[email protected]",
	Color: "#e66465",
	Password: "katanaP@assw0rd1",
	PhoneNumber: "2124567890",
	Placeholder: "katana",

CLI Option:

   -fc, form-config string                path to the form configuration file

Custom form config input
Email input randomization by default

(2) Could not request seed URL - context deadline exceeded (timeout)

katana -u https://44.199.9.133/ -d 10 -sjr -is

Error :

[ERR] Could not request seed URL: GET http://44.199.9.133/search/ giving up after 2 attempts: Get "http://44.199.9.133/search/": context deadline exceeded (Client.Timeout exceeded while awaiting heade

Request timed out because port 80 was not open on the server and katana found http url while crawling while the server had only port 443 open.

Possible Solutions can be auto upgrade to https or skip the url and continue testing

Please describe your feature request:

Similar to f, -sf is a new option that lets us write the values of single or multiple fields into txt file named after the scheme and host and a field key, i.e scheme_host_field_name.txt

CLI Option:

   -sf, -store-field string  field to store in output (fqdn,rdn,url,rurl,path,file,key,value,kv)

sf option is default to write in katana_output directory.

Example:

./katana -u https://example.com -f url -sf fqdn,key,dir

ls katana_output/

https_example.com_fqdn.txt
https_example.com_key.txt
https_example.com_dir.txt

Describe the use case of this feature:

This will allow us to write multiple type of url data into file that can be used further for various automation including

custom wordlist building
common query collection
common key values collection
common path collection
dns data collection and more.

JSON output improvement

Current output:

{
  "url": "https://www.hackerone.com/events/app-security-testing",
  "source": "a"
}

Updated output:

{
  "timestamp": "2022-08-22T04:46:23.405849+05:30"
  "endpoint": "https://www.hackerone.com/events/app-security-testing",  # endpoint is discovered url
  "source": "https://www.hackerone.com/events/"  # source is page url where the endpoint got discovered 
  "tag": "a", 
  "attribute": "href"
}

Event-loop experimental implementation

Add better 404 content detection and found link validation

Basically an auto-callibration feature like ffuf for relative links and other stuff in non-standard crawler

meta links are not parsed correctly

katana version:

main/dev

Example response:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>GETPAID</title>
<meta http-equiv="REFRESH" content="0;url=https://unitedcargobilling.ual.com/ngetpaid"></HEAD>
<BODY>
Redirecting
</BODY>
</HTML>

Current Behavior:

No results.

Expected Behavior:

https://unitedcargobilling.ual.com/ngetpaid should be parsed and crawled.

Steps To Reproduce:

echo https://unitedcargobilling.ual.com | ./katana

Anything else:

Related code:

katana/pkg/standard/parser.go

Lines 275 to 287 in f466d83

    
           func bodyMetaContentTagParser(resp navigationResponse, callback func(navigationRequest)) { 
        
           	resp.Reader.Find("meta[http-equiv='refresh']").Each(func(i int, item *goquery.Selection) { 
        
           		header, ok := item.Attr("content") 
        
           		if !ok { 
        
           			return 
        
           		} 
        
           		values := utils.ParseRefreshTag(header) 
        
           		if values == "" { 
        
           			return 
        
           		} 
        
           		callback(newNavigationRequestURL(values, resp.Resp.Request.URL.String(), "meta", "refresh", resp)) 
        
           	}) 
        
           }

Action item:

Fix
Test

Add xhr, async, etc call detection feature for headless

Please describe your feature request:

Describe the use case of this feature:

Create headless crawler design document

Description

A design document describing the functionality and the requirements of the headless variant of the crawler needs to be created. This will then be used to come up with the actual functionality of the crawler.

Improve body reading + fix hang on websocket 101 connect + misc improvements

Please describe your feature request:

Describe the use case of this feature:

Add parsing for OpenAPI, Swagger, etc as seed input

Example - https://github.com/projectdiscovery/sage/tree/main/pkg/formats

Scope to mimic burp suite scope behavior

   -cs, -crawl-scope string[]       in scope target to be followed by crawler
   -cos, -crawl-out-scope string[]  out of scope target to be excluded by crawler

-cs flag accepts word/regex that will be applicable to the URL component.

Investigate go-rod vs playwright-go vs others for crawler development

Description

Go has several headless libraries which are listed below -

https://github.com/go-rod/rod (cons: timeout not working properly, GetElement() stuck in a loop, leakless gets a binary from third party website)
https://github.com/playwright-community/playwright-go
https://github.com/chromedp/chromedp

We should investigate and decide on a library suitable for crawler design.

Metrics to consider:

Repository maintained
Community
Stability
Number of stars
Working correctly under heavy load

Add new option : -path-deny-list / -pdl to exclude paths from crawling

Add a new flag -path-deny-list / -pdl to exclude path(s) from crawling, Can be a list of paths / single path, comma separated path (via command line ), It will be useful for authenticated crawling, Where user doesn't want to make requests to logout paths to avoid cookie invalidation.

Form Bug - GET Request Body

katana version:

v0.0.1

Current Behavior:

Just running katana over my website (pretty basic Wordpress site), https://wya.pl, there is a form to search posts via the /?s= parameter. When I proxy the crawler, I can see that the form is identified and the parameter is filled with the value of katana. However, I can see that the resultant request copies the parameter into the body of the request.

Expected Behavior:

The form submits a GET request normally, so I'd expect for a GET request with the filled out parameter to be only in the query string. Since this is a GET request, I'd expect for there to be an empty HTTP body.

Steps To Reproduce:

Here is what the HTML form looks like (I swapped my site to localhost here to limit spam):

<form role="search" method="get" class="search-form" action="http://localhost/">
	<label>
		<span class="screen-reader-text">Search for:</span>
		<input type="search" class="search-field" placeholder="Search …" value="" name="s" title="Search for:">
	</label>
	<button type="submit" class="search-submit"><span class="screen-reader-text">Search</span></button>
</form>

Anything else:

Add concurrency and parallelism to standard and headless crawler

Implement concurrency and parallelism to both headless and non-headless crawler

Add hidden and bruteforce parameter options for endpoints

Please describe your feature request:

ParamMiner + other parameter discovery tool type integration support

Describe the use case of this feature:

Add Dockerfile

Please describe your feature request:

Dockerize katana, the container must pre-install headless broswer

Use hmap for deduplication instead of map to reduce memory usage

Please describe your feature request:

use hmap for optimized disk and memory hybrid storage

Describe the use case of this feature:

Added various built-in navigation paths for scraping

Please describe your feature request:

robots.txt
sitemap.xml

Fetch these files and parse them to get more endpoints

Describe the use case of this feature:

Create storage crawling graph implementation for headless crawler

Implement a crawl graph with relevant method for navigation and events storage and retrieval

Implement CLI wrapper around non-headless Katana Engine

CLI Client using goflags

Usage:
  ./katana [flags]

Flags:

INPUT:
   -u, -list string[]  target url / list to crawl (single / comma separated / file input)

CONFIGURATIONS:
   -config string                   cli flag configuration file
   -d, -depth                       maximum depth to crawl (default 1)
   -ct, -crawl-duration int         maximum duration to crawl the target for
   -mrs, -max-response-size int     maximum response size to read (default 10 MB)
   -timeout int                     time to wait in seconds before timeout (default 5)
   -p, -proxy string[]              http/socks5 proxy list to use (single / comma separated / file input)
   -H, -header string[]             custom header/cookie to include in request (single / file input)

SCOPE:
   -cs, -crawl-scope string[]       in scope target to be followed by crawler (single / comma separated / file input) # regex input
   -cos, -crawl-out-scope string[]  out of scope target to exclude by crawler (single / comma separated / file input) # regex input
   -is, -include-sub                include subdomains in crawl scope (false)

RATE-LIMIT:
   -c, -concurrency int          number of concurrent fetchers to use  (default 300)
   -rd, -delay int               request delay between each request in seconds (default -1)
   -rl, -rate-limit int          maximum requests to send per second (default 150)
   -rlm, -rate-limit-minute int  maximum number of requests to send per minute

OUTPUT:
   -o, -output string        output file to write
   -json                     write output in JSONL(ines) format (false)
   -nc, -no-color            disable output content coloring (ANSI escape codes) (false)
   -silent                   display output only (false)
   -v, -verbose              display verbose output (false)
   -version                  display project version

Reference:

https://github.com/projectdiscovery/gocrawl
https://github.com/projectdiscovery/katana/tree/backup/pkg/engine/standard (improved)

Implement human-like browser based headless navigation

Please describe your feature request:

Describe the use case of this feature:

Scope syntax Improvements

katana version:

main
dev

Current Behavior:

scope doesn't support the include/exclude options:

host:port
ip
ip:port
:port
cidr

Expected Behavior:

Support the previous syntax

Add framework specific crawling capabilities

Please describe your feature request:

Things like angular,react,etc

Describe the use case of this feature:

Execution context was destroyed

katana version:

dev | master

Current Behavior:

echo http://34.236.11.165 | ./katana -jc -headless -v

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[WRN] context canceled
[WRN] context canceled
[WRN] context canceled
[WRN] Could not request seed URL: {-32000 Execution context was destroyed. }

katana scope not working correctly with localhost

Please describe your feature request:

since localhost is invalid domain name without tld, host parsing is not working as expected

Describe the use case of this feature:

Basic headless navigation module

Description

Refactor the code to extract common functionalities and define interface for multiple crawling engines
Add basic headless crawling functionality that in one pass analyses dom-rendered data and intercepts passively all js-generated http requests

Notes: extractors, analyzers and edge cases will be handled as part of #16

Endpoint in JS is not crawled / follwoing depth option

katana version:

main

Current Behavior:

Endpoints in javascript files is not crawled after intial detection / does not follow depth options.

Expected Behavior:

-depth options to be respected for javascript files or any extension.

Steps To Reproduce:

echo https://projectdiscovery.io/app.js | ./katana -sjr | wc 
383

$echo https://projectdiscovery.io/app.js | ./katana -sjr -d 5 | wc 
383

$echo https://projectdiscovery.io/app.js | ./katana -sjr -d 10 | wc 
383

bug in url regex to print non http/s urls

Validate if extracted element is a URL and filter out non http/https slugs from output.

Leakless binary flagged as malicious by Windows Defender

katana version:

dev

Current Behavior:

Leakless binary is flagged as malicious by Windows Deferender

Expected Behavior:

Headless instances cleanup

Steps To Reproduce:

> go run . -cs 127.0.0.1 -u http://127.0.0.1:8000 -headless > head.txt

   __        __
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1

                projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[FTL] Could not process: could not execute crawling: could not create standard crawler: fork/exec C:\Users\user\AppData\Local\Temp\leakless-0c3354cd58f0813bb5b34ddf3a7c16ed\leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.
exit status 1

Notes: partially solved in https://github.com/projectdiscovery/nuclei/blob/1010cca84e62e04cd675debfce20ce96d2e9cd3c/v2/pkg/protocols/headless/engine/engine.go#L158

Command line suggestions

Either set the default value of -cs flag to include only the current domain in crawl scope or add another flag -cscd ( current domain crawl scope) so that katana only crawls current domain.
New parameter : -nqs ( No query string ) : When user doesn't want any query strings in output, Can be useful for further fuzzing purposes. Can be done easily otherwise but will be better if supported natively.

Output :
echo https://www.google.com | katana -d 1

https://policies.google.com/terms?hl=en-IN&fg=1
https://www.google.com/url?sa=t&rct=j&source=webhp&url=https://policies.google.com/terms%3Fhl%3Den-IN%26fg%3D1&ved=0ahUKEwjK7qb7mPz5AhVfUGwGHbDMC3gQ8qwCCB0
https://www.google.com/preferences?hl=en-IN&fg=1

echo https://www.google.com | katana -d 1 -nqs
Desired Output :

https://policies.google.com/terms
https://www.google.com/url
https://www.google.com/preferences

Add login integrations + scripts for authenticated crawling

Please describe your feature request:

Describe the use case of this feature:

Investigate headless as a proxy

Please describe your feature request:

We need to investigate if it's possible to use chrome as a proxy for HTTP/HTTPS requests. At current time requests are performed with go client via go-rod hijacking.

Describe the use case of this feature:

HTTP requests would have native browser fingerprinting and full context

Errors in default run

Error information can be moved from default mode to verbose mode.

https://privacy.thewaltdisneycompany.com/app/themes/privacycenter/assets/dist/js/app-cfa6fbf0.min.js
https://privacy.thewaltdisneycompany.com/en/?s=katana&sentence=1
[ERR] Could not request seed URL: GET http://44.199.9.133/savings/ giving up after 2 attempts: Get "http://44.199.9.133/savings/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/membership/costs/ giving up after 2 attempts: Get "http://44.199.9.133/membership/costs/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/destinations/dvc-resorts/ giving up after 2 attempts: Get "http://44.199.9.133/destinations/dvc-resorts/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/explore-membership/ giving up after 2 attempts: Get "http://44.199.9.133/explore-membership/": no address found for host (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/destinations/explore-disney-destinations-and-resort-hotels/ giving up after 2 attempts: Get "http://44.199.9.133/destinations/explore-disney-destinations-and-resort-hotels/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/star-wars-galactic-starcruiser/ giving up after 2 attempts: Get "http://44.199.9.133/star-wars-galactic-starcruiser/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/discounts-perks-offers/ giving up after 2 attempts: Get "http://44.199.9.133/discounts-perks-offers/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/points-and-flexibility/ giving up after 2 attempts: Get "http://44.199.9.133/points-and-flexibility/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/membership-magic/ giving up after 2 attempts: Get "http://44.199.9.133/membership-magic/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Could not request seed URL (stopped after 10 redirects)

./katana -u https://www.hackerone.com -csd hackerone.com -is -d 5

https://www.hackerone.com/vulnerability-management/vulnerability-assessment-i-complete-guide
https://www.hackerone.com/vulnerability-management/vulnerability-assessment-tools-top-tools-what-they-do
https://www.hackerone.com/vulnerability-management/bug-bounty-vs-vdp-which-program-right-you
[ERR] Could not request seed URL: Get "/vulnerability-management/critical-introducing-severity-cvss": stopped after 10 redirects

Overhaul form-filling module to support more elements + custom fields

Support user specified values based on field names
Support more html elements
Refactor + improve

invalid / blank url being requested for crawl

katana version:

dev

Current Behavior:

blank url / non http/s protocol being requested.

Expected Behavior:

only crawl / request valid, http/s URL.

[ERR] Could not request seed URL: Get "javascript:window.print();": unsupported protocol scheme "javascript"
[ERR] Could not request seed URL: context deadline exceeded (Client.Timeout or context cancellation while reading body)

-H / -header not working as intended

-H is not working as intended :

root@bhramastra:/tmp/urldedupe# echo https://ylnhy1urfxmutnoat5qenl43hunkb9.oastify.com/ | katana -d 3 -o hk3 -c 100 -p 100 -rl 1500 -is -H "Cookie: ccc=ddd"

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[ERR] Could not request seed URL: GET https://ylnhy1urfxmutnoat5qenl43hunkb9.oastify.com/ giving up after 2 attempts: Get "https://ylnhy1urfxmutnoat5qenl43hunkb9.oastify.com/": net/http: invalid header field name "Cookie: ccc"

Create navigation module for headless crawler

Handle generic prompts (popups, alerts, etc)
Use go-rod to discover hooks and events and enqueue new discovered seed urls
Detect and handle navigation loops/edge cases with circuit breaker mechanisms: timeouts, errors

Create crawler benchmarking testbed

Tasks

Headless testbed
Non-headless testbed
Add crawl scoring
Cover variety of scenarios like login / register / logout / dynamic links / event listeners / random ids, etc.

Resources

https://github.com/google/security-crawl-maze
AJAX scenarios for crawler navigation testing

predefined fields to control output

Please describe your feature request:

CLI Option:

   -f, -field           field to display in output (fqdn,rdn,url,rurl,path,file,key,value,kv) (default url)

Example:

Field	Example
`url` (default)	`https://policies.google.com/terms/file.php?hl=en-IN&fg=1`
`rurl` (root url)	`https://policies.google.com`
`path`	`/terms/file.php?hl=en-IN&fg=1`
`file`	`file.php`
`key`	`hl`,`fg`
`value`	`en-IN`,`1`
`kv`	`hl=en-IN&fg=1`
`fqdn`	`policies.google.com`
`rdn`	`google.com`

Example run:

echo https://example.com | ./katana -f path -silent

/domains
/protocols
/numbers
/about
/go/rfc2606
/go/rfc6761
/http://www.icann.org/topics/idn/
/http://www.icann.org/
/domains/root/db/xn--kgbechtv.html
/domains/root/db/xn--hgbk6aj7f53bba.html
/domains/root/db/xn--0zwm56d.html
/domains/root/db/xn--g6w251d.html
/domains/root/db/xn--80akhbyknj4f.html
/domains/root/db/xn--11b5bs3a9aj6g.html
/domains/root/db/xn--jxalpdlp.html
/domains/root/db/xn--9t4b11yi5a.html
/domains/root/db/xn--deba0ad.html
/domains/root/db/xn--zckzah.html
/domains/root/db/xn--hlcj6aya9esc7a.html
/assignments/special-use-domain-names
/domains/root
/domains/root/db
/domains/root/files
/domains/root/manage
/domains/root/help
/domains/root/servers
/domains/int
/domains/int/manage
/domains/int/policy
/domains/arpa
/domains/idn-tables
/procedures/idn-repository.html
/dnssec
/dnssec/files
/dnssec/ceremonies
/dnssec/procedures
/dnssec/tcrs
/dnssec/archive
/domains/reserved
/abuse
/time-zones
/about/presentations
/reports
/performance
/reviews
/about/excellence
/contact
/_js/jquery.js
/_js/iana.js

This will be similar to uncover implementation of uncover - https://github.com/projectdiscovery/uncover#field-format

Describe the use case of this feature:

Control outptu as required for further processing / scanning / record

katana crashes when scanning large list of urls

cat ~/tmp/b1.txt | katana -d 3 -f udir -cs booking -is -c 100 -p 100 -rl 2000 -o test

Result :

xx
xx
xx
....
Killed

Create analyzer for scraping new navigation from headless page states

Anchor, Button, Embed, and Iframe for direct links.
Parse and fill HTML Forms as well optionally. (Login, Register, etc using these methods)
Scrape javascript / javascript files and collect links using regex.
Collect requests made by XHR/Javascript APIs as well.
Elements having event listeners can be navigated by querying the DOM or using JS hooks. (Decide on whether we want to use JS hooks or query the DOM)
Other relevant information can be decided in the future or depending upon demand.

default scope option update and no scope option

Please describe your feature request:

Adding host based default scope
Removing extension related filters, e, extensions-allow-list, extensions-deny-list (not being done because wouldn't work separately)
Removing csd, cosd option
Adding -no-scope option to disable default scope.

   -ns, -no-scope             disable host based default scope.

Describe the use case of this feature:

optimizing default behavior to be target specific.
optionally user can disable behavior when required.
removing options that are already covered under cs/cos option.

Create navigation and element deduplication mechanism for headless crawler

As title says, implement Navigation and Element structures and create their specific deduplication mechanisms

For deduplication, consider element attribute hashing, partial hashing, or similarity hashing. Do benchmarks and choose the best working method.

Use Heuristic Inference to improve form-fill capability

Please describe your feature request:

Automatic form filling without context is a hard task. After implementing a series of robust standard rules, it would be interesting to investigate further strategies to infer the form category from the page:

What is the topic of the page - semantical analysis (form to book an airplane, form to subscribe to a newsletter, login form, etc)
Can we use prior knowledge to classify similar forms? (forms from web frameworks, from taken from snippets on the web)
Form filling can be a multistep operation: we need an autonomous approach (eg. with a fitness function that rewards the most promising filling paths)

	func bodyMetaContentTagParser(resp navigationResponse, callback func(navigationRequest)) {
	resp.Reader.Find("meta[http-equiv='refresh']").Each(func(i int, item *goquery.Selection) {
	header, ok := item.Attr("content")
	if !ok {
	return
	}
	values := utils.ParseRefreshTag(header)
	if values == "" {
	return
	}
	callback(newNavigationRequestURL(values, resp.Resp.Request.URL.String(), "meta", "refresh", resp))
	})
	}

projectdiscovery / katana Goto Github PK

katana's People

Contributors

Stargazers

Watchers

Forkers

katana's Issues

Description

Project Version

Please describe your feature request:

Please describe your feature request:

Please describe your feature request:

Describe the use case of this feature:

katana version:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else:

Action item:

Please describe your feature request:

Describe the use case of this feature:

Description

Please describe your feature request:

Describe the use case of this feature:

Description

katana version:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else:

Please describe your feature request:

Describe the use case of this feature:

Please describe your feature request:

Please describe your feature request:

Describe the use case of this feature:

Please describe your feature request:

Describe the use case of this feature:

Please describe your feature request:

Describe the use case of this feature:

katana version:

Current Behavior:

Expected Behavior:

Please describe your feature request:

Describe the use case of this feature:

katana version:

Current Behavior:

Please describe your feature request:

Describe the use case of this feature:

Description

katana version:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

katana version:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Please describe your feature request:

Describe the use case of this feature:

Please describe your feature request:

Describe the use case of this feature:

katana version:

Current Behavior:

Expected Behavior:

Tasks

Resources

Please describe your feature request:

Describe the use case of this feature:

Please describe your feature request:

Describe the use case of this feature:

Please describe your feature request:

Recommend Projects

Recommend Topics

Recommend Org