fcavallarin / htcap Goto Github PK

htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.

License: GNU General Public License v2.0

Python 73.59% HTML 1.56% JavaScript 19.39% CSS 4.58% Dockerfile 0.89%

htcap's People

Contributors

Stargazers

Watchers

Forkers

coca1ne thisiseast iambrosie securityigi lephuhai samyoyo johnjohnsp1 digideskio chennqqi afczys geassdb rwboy zvall go-spider idkwim igpg magnologan programming086 olivierh59500 delvelabs lubyruffy ginnz sanwenkit blueroutecn 5up3rc webvul safe3 owlwang nuadaandre cainiaoxiaobai2016 aliluyala av1080p linxi0428 infosecsecurity arryboom h3ll0w0lrd jijicanyu missdiog bluemember icysun s0x06 zdoop 0irebrwe pacejj27 havahappy fb11 alkenepan mbox0369 stvhanna ro9ueadmin tactifail seabreg xiaolushuo rajivraj m00zh33 raystyle attackgithub jack51706 waterbolik d6626410 killvxk alessiodallapiazza faf0-addepar peval ellipsys snrtherock zan00789 js-residence-hotel js-residence2559 exenin o-halfway anquanscan grayguest thezedwards modulexcite unclejim markwh245 kenanat jackyyvan w4fz5uck5 0ps beike2020 mauriziocasciano mmg1 rfma safebaseline polling-repo-continua slooppe darknesschieftain bantic mixintu cqr-cryeye-forks medasz gnebbia zumb08 lide-track hartl3y94 shakenetwork hermansyah32 jeromeyoung

htcap's Issues

cann't get form submit request

aggressive mode does not fill input values

Hello . Thanks for the wonderful tools
When see the document i found something follow.
The aggressive mode makes htcap to also fill input values and post forms. This simulates a user that performs as many actions as possible on the page.
But when i test for a login form it does not fill input values.
like this
python htcap.py crawl -v http://1.1.1.1:8080/ target.db
Initializing . . . done
Database target-5.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
crawl result for: link GET http://1.1.1.1:8080/
new request found form POST http://1.1.1.1:8080/login name=&pwd=&code=
crawl result for: form POST http://1.1.1.1:8080/login name=&pwd=&code=

What i think it should like this name=aaa&pwd=aaa&code=111
is anything wrong?

Crawler Authorization Header Issue

Hi htcap authors.

I have recently discovered your tool and started using it for pen testing a SPA application based on Angular and Spring Boot. The security of the application is based on tokens, which must be provided on every HTTP request to the REST API as a header (ex. Auth: Basic tokenvalue).

Unfortunately, I am not able to correctly use the crawler, which stops at the login page. I provided the required header using the parameter -E (ex. python htcap.py crawl -E 'Auth=Basic tokenvalue' target dest) and it does not work (double checked with Wireshark, which does not show the Header being added in the requests).

I also tried using the credentials parameter (ex. python htcap.py crawl -l -A 'user:pass' target dest), but it does not work either. When trying to login to the application, the crawler uses random strings each time.

Is this is a bug or I am using your tool wrong? Could you please provide some information?

Migrate htcap to python 3

I wanted to integrate htcap in one of my docker container and all the rest of my code is in Python 3.

Aside my own usage, here is the advantage to migrate to python 3:

Support for Unicode out-of-the-box
Better support of the language (no more update on python 2)
Less and less lib are maintain/compatible with python 2
Better support of ASync task

/core/crawl/crawler.py script is not working properly

It should be like this:

-- coding: utf-8 --

"""
HTCAP - beta 1
Author: [email protected]

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
"""

from future import unicode_literals
import sys
import os
import datetime
import time
import getopt
import json
import re
from urlparse import urlsplit, urljoin
from urllib import unquote
import urllib2
import threading
import subprocess
from random import choice
import string
import ssl
import signal

from core.lib.exception import *
from core.lib.cookie import Cookie
from core.lib.database import Database

from lib.shared import *
from lib.crawl_result import *
from core.lib.request import Request
from core.lib.http_get import HttpGet
from core.lib.shell import CommandExecutor

from crawler_thread import CrawlerThread
#from core.lib.shingleprint import ShinglePrint
from core.lib.texthash import TextHash
from core.lib.request_pattern import RequestPattern
from core.lib.utils import *
from core.constants import *
from lib.utils import *
from core.lib.progressbar import Progressbar

class Crawler:

def __init__(self, argv):

	self.base_dir = getrealdir(__file__) + os.sep

	self.crawl_start_time = int(time.time())
	self.crawl_end_time = None
	self.page_hashes = []
	self.request_patterns = []
	self.db_file = ""
	self.display_progress = True
	self.verbose = False
	self.defaults = {
		"useragent": 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3582.0 Safari/537.36',
		"num_threads": 10,
		"max_redirects": 10,
		"out_file_overwrite": False,
		"proxy": None,
		"http_auth": None,
		"use_urllib_onerror": True,
		"group_qs": False,
		"process_timeout": 300,
		"scope": CRAWLSCOPE_DOMAIN,
		"mode": CRAWLMODE_AGGRESSIVE,
		"max_depth": 100,
		"max_post_depth": 10,
		"override_timeout_functions": True,
		'crawl_forms': True, # only if mode == CRAWLMODE_AGGRESSIVE
		'deduplicate_pages': True,
		'headless_chrome': True,
		'extra_headers': False,
		'login_sequence': None,
		'simulate_real_events': True
	}


	self.main(argv)



def usage(self):
	infos = get_program_infos()
	print ("htcap crawler ver " + infos['version'] + "\n"
		   "usage: crawl [options] url outfile\n"
		   "hit ^C to pause the crawler or change verbosity\n"
		   "Options: \n"
		   "  -h               this help\n"
		   "  -w               overwrite output file\n"
		   "  -q               do not display progress informations\n"
		   "  -v               be verbose\n"
		   "  -m MODE          set crawl mode:\n"
		   "                      - "+CRAWLMODE_PASSIVE+": do not intract with the page\n"
		   "                      - "+CRAWLMODE_ACTIVE+": trigger events\n"
		   "                      - "+CRAWLMODE_AGGRESSIVE+": also fill input values and crawl forms (default)\n"
		   "  -s SCOPE         set crawl scope\n"
		   "                      - "+CRAWLSCOPE_DOMAIN+": limit crawling to current domain (default)\n"
		   "                      - "+CRAWLSCOPE_DIRECTORY+": limit crawling to current directory (and subdirecotries) \n"
		   "                      - "+CRAWLSCOPE_URL+": do not crawl, just analyze a single page\n"
		   "  -D               maximum crawl depth (default: " + str(Shared.options['max_depth']) + ")\n"
		   "  -P               maximum crawl depth for consecutive forms (default: " + str(Shared.options['max_post_depth']) + ")\n"
		   "  -F               even if in aggressive mode, do not crawl forms\n"
		   "  -H               save HTML generated by the page\n"
		   "  -d DOMAINS       comma separated list of allowed domains (ex *.target.com)\n"
		   "  -c COOKIES       cookies as json or name=value pairs separaded by semicolon\n"
		   "  -C COOKIE_FILE   path to file containing COOKIES \n"
		   "  -r REFERER       set initial referer\n"
		   "  -x EXCLUDED      comma separated list of urls to exclude (regex) - ie logout urls\n"
		   "  -p PROXY         proxy string protocol:host:port -  protocol can be 'http' or 'socks5'\n"
		   "  -n THREADS       number of parallel threads (default: " + str(self.defaults['num_threads']) + ")\n"
		   "  -A CREDENTIALS   username and password used for HTTP authentication separated by a colon\n"
		   "  -U USERAGENT     set user agent\n"
		   "  -t TIMEOUT       maximum seconds spent to analyze a page (default " + str(self.defaults['process_timeout']) + ")\n"
		   "  -S               skip initial checks\n"
		   "  -G               group query_string parameters with the same name ('[]' ending excluded)\n"
		   "  -N               don't normalize URL path (keep ../../)\n"
		   "  -R               maximum number of redirects to follow (default " + str(self.defaults['max_redirects']) + ")\n"
		   "  -I               ignore robots.txt\n"
		   "  -O               dont't override timeout functions (setTimeout, setInterval)\n"
		   "  -e               disable hEuristic page deduplication\n"
		   "  -l               do not run chrome in headless mode\n"
		   "  -E HEADER        set extra http headers (ex -E foo=bar -E bar=foo)\n"
		   "  -M               don't simulate real mouse/keyboard events\n"
		   "  -L SEQUENCE      set login sequence\n"
		   )


def generate_filename(self, name, out_file_overwrite):
	fname = generate_filename(name, None, out_file_overwrite)
	if out_file_overwrite:
		if os.path.exists(fname):
			os.remove(fname)

	return fname



def kill_threads(self, threads):
	Shared.th_condition.acquire()
	for th in threads:
		if th.isAlive():
			th.exit = True
			th.pause = False
			if th.probe_executor and th.probe_executor.cmd:
				th.probe_executor.cmd.terminate()
	Shared.th_condition.release()

	# start notify() chain
	Shared.th_condition.acquire()
	Shared.th_condition.notifyAll()
	Shared.th_condition.release()


def pause_threads(self, threads, pause):
	Shared.th_condition.acquire()
	for th in threads:
		if th.isAlive():
			th.pause = pause
	Shared.th_condition.release()


def init_db(self, dbname, report_name):
	infos = {
		"target": Shared.starturl,
		"scan_date": -1,
		"urls_scanned": -1,
		"scan_time": -1,
		'command_line': " ".join(sys.argv)
	}

	database = Database(dbname, report_name, infos)
	database.create()
	return database



def check_startrequest(self, request):

	h = HttpGet(request, Shared.options['process_timeout'], 2, Shared.options['useragent'], Shared.options['proxy'])
	try:
		h.get_requests()
	except NotHtmlException:
		print "\nError: Document is not html"
		sys.exit(1)
	except Exception as e:
		print "\nError: unable to open url: %s" % e
		sys.exit(1)



def get_requests_from_robots(self, request):
	purl = urlsplit(request.url)
	url = "%s://%s/robots.txt" % (purl.scheme, purl.netloc)

	getreq = Request(REQTYPE_LINK, "GET", url, extra_headers=Shared.options['extra_headers'])
	try:
		# request, timeout, retries=None, useragent=None, proxy=None):
		httpget = HttpGet(getreq, 10, 1, "Googlebot", Shared.options['proxy'])
		lines = httpget.get_file().split("\n")
	except urllib2.HTTPError:
		return []
	except:
		return []
		#raise

	requests = []
	for line in lines:
		directive = ""
		url = None
		try:
			directive, url = re.sub("\#.*","",line).split(":",1)
		except:
			continue # ignore errors

		if re.match("(dis)?allow", directive.strip(), re.I):
			req = Request(REQTYPE_LINK, "GET", url.strip(), parent=request)
			requests.append(req)


	return adjust_requests(requests) if requests else []



def randstr(self, length):
	all_chars = string.digits + string.ascii_letters + string.punctuation
	random_string = ''.join(choice(all_chars) for _ in range(length))
	return random_string


def request_is_duplicated(self, page_hash):
	for h in self.page_hashes:
		if TextHash.compare(page_hash, h):
			return True
	return False


def main_loop(self, threads, start_requests, database):
	pending = len(start_requests)
	crawled = 0
	pb = Progressbar(self.crawl_start_time, "pages processed")
	req_to_crawl = start_requests
	while True:
		try:

			if self.display_progress and not self.verbose:
				tot = (crawled + pending)
				pb.out(tot, crawled)

			if pending == 0:
				# is the check of running threads really needed?
				running_threads = [t for t in threads if t.status == THSTAT_RUNNING]
				if len(running_threads) == 0:
					if self.display_progress or self.verbose:
						print ""
					break

			if len(req_to_crawl) > 0:
				Shared.th_condition.acquire()
				Shared.requests.extend(req_to_crawl)
				Shared.th_condition.notifyAll()
				Shared.th_condition.release()

			req_to_crawl = []
			Shared.main_condition.acquire()
			Shared.main_condition.wait(1)
			if len(Shared.crawl_results) > 0:
				database.connect()
				database.begin()
				for result in Shared.crawl_results:
					crawled += 1
					pending -= 1
					if self.verbose:
						print "crawl result for: %s " % result.request
						if len(result.request.user_output) > 0:
							print "  user: %s" % json.dumps(result.request.user_output)
						if result.errors:
							print "* crawler errors: %s" % ", ".join(result.errors)

					database.save_crawl_result(result, True)

					if Shared.options['deduplicate_pages']:
						if self.request_is_duplicated(result.page_hash):
							filtered_requests = []
							for r in result.found_requests:
								if RequestPattern(r).pattern not in self.request_patterns:
									filtered_requests.append(r)
							result.found_requests = filtered_requests
							if self.verbose:
								print " * marked as duplicated ... requests filtered"

						self.page_hashes.append(result.page_hash)
						for r in result.found_requests:
							self.request_patterns.append(RequestPattern(r).pattern)

					for req in result.found_requests:

						database.save_request(req)

						if self.verbose and req not in Shared.requests and req not in req_to_crawl:
								print "  new request found %s" % req

						if request_is_crawlable(req) and req not in Shared.requests and req not in req_to_crawl:

							if request_depth(req) > Shared.options['max_depth'] or request_post_depth(req) > Shared.options['max_post_depth']:
								if self.verbose:
									print "  * cannot crawl: %s : crawl depth limit reached" % req
								result = CrawlResult(req, errors=[ERROR_CRAWLDEPTH])
								database.save_crawl_result(result, False)
								continue

							if req.redirects > Shared.options['max_redirects']:
								if self.verbose:
									print "  * cannot crawl: %s : too many redirects" % req
								result = CrawlResult(req, errors=[ERROR_MAXREDIRECTS])
								database.save_crawl_result(result, False)
								continue

							pending += 1
							req_to_crawl.append(req)

				Shared.crawl_results = []
				database.commit()
				database.close()
			Shared.main_condition.release()

		except KeyboardInterrupt:
			try:
				Shared.main_condition.release()
				Shared.th_condition.release()
			except:
				pass
			self.pause_threads(threads, True)
			if not self.get_runtime_command():
				print "Exiting . . ."
				return
			print "Crawler is running"
			self.pause_threads(threads, False)




def get_runtime_command(self):
	while True:
		print (
			"\nCrawler is paused.\n"
			"   r    resume\n"
			"   v    verbose mode\n"
			"   p    show progress bar\n"
			"   q    quiet mode\n"
			"Hit ctrl-c again to exit\n"
		)
		try:
			ui = raw_input("> ").strip()
		except KeyboardInterrupt:
			print ""
			return False

		if ui == "r":
			break
		elif ui == "v":
			self.verbose = True
			break
		elif ui == "p":
			self.display_progress = True
			self.verbose = False
			break
		elif ui == "q":
			self.verbose = False
			self.display_progress = False
			break
		print " "

	return True

def init_crawl(self, start_req, check_starturl, get_robots_txt):
	start_requests = [start_req]
	try:
		if check_starturl:
			self.check_startrequest(start_req)
			stdoutw(". ")

		if get_robots_txt:
			rrequests = self.get_requests_from_robots(start_req)
			stdoutw(". ")
			for req in rrequests:
				if request_is_crawlable(req) and not req in start_requests:
					start_requests.append(req)
	except KeyboardInterrupt:
		print "\nAborted"
		sys.exit(0)

	return start_requests


def main(self, argv):
	Shared.options = self.defaults
	Shared.th_condition = threading.Condition()
	Shared.main_condition = threading.Condition()

	deps_errors = check_dependences(self.base_dir)
	if len(deps_errors) > 0:
		print "Dependences errors: "
		for err in deps_errors:
			print "  %s" % err
		sys.exit(1)

	start_cookies = []
	start_referer = None

	probe_options = ["-R", self.randstr(20)]
	threads = []
	num_threads = self.defaults['num_threads']

	out_file = ""
	out_file_overwrite = self.defaults['out_file_overwrite']
	cookie_string = None
	initial_checks = True
	http_auth = None
	get_robots_txt = True
	save_html = False

	try:
		opts, args = getopt.getopt(argv[2:], 'hc:t:jn:x:A:p:d:BGR:U:wD:s:m:C:qr:SIHFP:OvelE:L:M')
	except getopt.GetoptError as err:
		print str(err)
		sys.exit(1)


	if len(argv) < 2:
		self.usage()
		sys.exit(1)


	for o, v in opts:
		if o == '-h':
			self.usage()
			sys.exit(0)
		elif o == '-c':
			cookie_string = v
		elif o == '-C':
			try:
				with open(v) as cf:
					cookie_string = cf.read()
			except Exception as e:
				print "error reading cookie file"
				sys.exit(1)
		elif o == '-r':
			start_referer = v
		elif o == '-n':
			num_threads = int(v)
		elif o == '-t':
			Shared.options['process_timeout'] = int(v)
		elif o == '-q':
			self.display_progress = False
		elif o == '-A':
			http_auth = v
		elif o == '-p':
			try:
				Shared.options['proxy'] = parse_proxy_string(v)
			except Exception as e:
				print e
				sys.exit(1)
		elif o == '-d':
			for ad in v.split(","):
				# convert *.domain.com to *.\.domain\.com
				pattern = re.escape(ad).replace("\\*\\.","((.*\\.)|)")
				Shared.allowed_domains.add(pattern)
		elif o == '-x':
			for eu in v.split(","):
				try:
					re.match(eu, "")
				except:
					print "* ERROR: regex failed: %s" % eu
					sys.exit(1)
				Shared.excluded_urls.add(eu)
		elif o == "-G":
			Shared.options['group_qs'] = True
		elif o == "-w":
			out_file_overwrite = True
		elif o == "-R":
			Shared.options['max_redirects'] = int(v)
		elif o == "-U":
			Shared.options['useragent'] = v
		elif o == "-s":
			if not v in (CRAWLSCOPE_DOMAIN, CRAWLSCOPE_DIRECTORY, CRAWLSCOPE_URL):
				self.usage()
				print "* ERROR: wrong scope set '%s'" % v
				sys.exit(1)
			Shared.options['scope'] = v
		elif o == "-m":
			if not v in (CRAWLMODE_PASSIVE, CRAWLMODE_ACTIVE, CRAWLMODE_AGGRESSIVE):
				self.usage()
				print "* ERROR: wrong mode set '%s'" % v
				sys.exit(1)
			Shared.options['mode'] = v
		elif o == "-S":
			initial_checks = False
		elif o == "-I":
			get_robots_txt = False
		elif o == "-H":
			save_html = True
		elif o == "-D":
			Shared.options['max_depth'] = int(v)
		elif o == "-P":
			Shared.options['max_post_depth'] = int(v)
		elif o == "-O":
			Shared.options['override_timeout_functions'] = False
		elif o == "-F":
			Shared.options['crawl_forms'] = False
		elif o == "-v":
			self.verbose = True
		elif o == "-e":
			Shared.options['deduplicate_pages'] = False
		elif o == "-l":
			Shared.options['headless_chrome'] = False
		elif o == "-M":
			Shared.options['simulate_real_events'] = False
		elif o == "-E":
			if not Shared.options['extra_headers']:
				Shared.options['extra_headers'] = {}
			(hn, hv) = v.split("=", 1)
			Shared.options['extra_headers'][hn] = hv
		elif o == "-L":
			try:
				with open(v) as cf:
					Shared.options['login_sequence'] = json.loads(cf.read())
					Shared.options['login_sequence']["__file__"] = os.path.abspath(v)
			except ValueError as e:
				print "* ERROR: decoding login sequence"
				sys.exit(1)
			except Exception as e:
				print "* ERROR: login sequence file not found"
				sys.exit(1)


	probe_cmd = get_node_cmd()
	if not probe_cmd: # maybe useless
		print "Error: unable to find node executable"
		sys.exit(1)


	if Shared.options['scope'] != CRAWLSCOPE_DOMAIN and len(Shared.allowed_domains) > 0:
		print "* Warinig: option -d is valid only if scope is %s" % CRAWLSCOPE_DOMAIN

	if cookie_string:
		try:
			start_cookies = parse_cookie_string(cookie_string)
		except Exception as e:
			print "error decoding cookie string"
			sys.exit(1)

	if Shared.options['mode'] != CRAWLMODE_AGGRESSIVE:
		probe_options.append("-f") # dont fill values
	if Shared.options['mode'] == CRAWLMODE_PASSIVE:
		probe_options.append("-t") # dont trigger events

	if Shared.options['proxy']:
		probe_options.extend(["-y", "%s:%s:%s" % (Shared.options['proxy']['proto'], Shared.options['proxy']['host'], Shared.options['proxy']['port'])])
	if not Shared.options['headless_chrome']:
		probe_options.append("-l")
	probe_cmd.append(os.path.join(self.base_dir, 'probe', 'analyze.js'))


	if len(Shared.excluded_urls) > 0:
		probe_options.extend(("-X", ",".join(Shared.excluded_urls)))

	if save_html:
		probe_options.append("-H")


	probe_options.extend(("-x", str(Shared.options['process_timeout'])))
	probe_options.extend(("-A", Shared.options['useragent']))

	if not Shared.options['override_timeout_functions']:
		probe_options.append("-O")

	if Shared.options['extra_headers']:
		probe_options.extend(["-E", json.dumps(Shared.options['extra_headers'])])

	if not Shared.options['simulate_real_events']:
		probe_options.append("-M")

	Shared.probe_cmd = probe_cmd + probe_options


	Shared.starturl = normalize_url(argv[0])
	out_file = argv[1]

	purl = urlsplit(Shared.starturl)
	Shared.allowed_domains.add(purl.hostname)



	if Shared.options['login_sequence'] and Shared.options['login_sequence']['type'] == LOGSEQTYPE_SHARED:
		login_req = Request(REQTYPE_LINK, "GET", Shared.options['login_sequence']['url'],
			set_cookie=Shared.start_cookies,
			http_auth=http_auth,
			referer=start_referer,
			extra_headers=Shared.options['extra_headers']
		)
		stdoutw("Logging in . . . ")
		try:
			pe = ProbeExecutor(login_req, Shared.probe_cmd + ["-z"], login_sequence=Shared.options['login_sequence'])
			probe = pe.execute()
			if not probe:
				print "\n* ERROR: login sequence failed to execute probe"
				sys.exit(1)
			if probe.status == "ok":
				for c in probe.cookies:
					if not Shared.options['login_sequence']['cookies'] or c.name in Shared.options['login_sequence']['cookies']:
						Shared.start_cookies.append(c)
			else:
				print "\n* ERROR: login sequence failed:\n   %s" % probe.errmessage
				sys.exit(1)
		except KeyboardInterrupt:
			pe.terminate()
			print "\nAborted"
			sys.exit(0)
		print "done"


	for sc in start_cookies:
		Shared.start_cookies.append(Cookie(sc, Shared.starturl))


	start_req = Request(REQTYPE_LINK, "GET", Shared.starturl,
		set_cookie=Shared.start_cookies,
		http_auth=http_auth,
		referer=start_referer,
		extra_headers=Shared.options['extra_headers']
	)

	if not hasattr(ssl, "SSLContext"):
		print "* WARNING: SSLContext is not supported with this version of python, consider to upgrade to >= 2.7.9 in case of SSL errors"

	stdoutw("Initializing . ")

	start_requests = self.init_crawl(start_req, initial_checks, get_robots_txt)

	database = None
	self.db_file = self.generate_filename(out_file, out_file_overwrite)
	try:
		database = self.init_db(self.db_file, out_file)
	except Exception as e:
		print str(e)
		sys.exit(1)

	database.save_crawl_info(
		htcap_version = get_program_infos()['version'],
		target = Shared.starturl,
		start_date = self.crawl_start_time,
		commandline = cmd_to_str(argv),
		user_agent = Shared.options['useragent'],
		proxy = json.dumps(Shared.options['proxy']),
		extra_headers = json.dumps(Shared.options['extra_headers']),
		cookies = json.dumps([x.get_dict() for x in Shared.start_cookies])
	)

	database.connect()
	database.begin()
	for req in start_requests:
		database.save_request(req)
	database.commit()
	database.close()

	print "done"
	print "Database %s initialized, crawl started with %d threads (^C to pause or change verbosity)" % (self.db_file, num_threads)

	for n in range(0, num_threads):
		thread = CrawlerThread()
		threads.append(thread)
		thread.start()


	self.main_loop(threads, start_requests, database)

	self.kill_threads(threads)

	self.crawl_end_time = int(time.time())

	print "Crawl finished, %d pages analyzed in %d minutes" % (Shared.requests_index, (self.crawl_end_time - self.crawl_start_time) / 60)

	database.save_crawl_info(end_date=self.crawl_end_time)

Is htcap.org owned by this project?

This project's readme links to htcap.org, but when I go there, I get redirected to some sketchy Chrome extension:

Is something wrong?

Site using `window.location` to navigate are not crawled properly

Site relying on window.location for navigation are not crawled properly.

A crawl on a page like this one will not see all the site:

<html>
	<head>
	<script>
		function go1(){
			window.location = "../1.php";
		}
		function go2(){
			window.location = "../2.php";
		}
		function go3(){
			window.location = "../3.php";
		}
	</script>
	</head>
	<body>
		<span onclick="go1()">click here </span> 
		<a onclick="go2()">click here </a> 
		<td onclick="go3()">click here </td> 
	</body>
</html>

add a parameter: load target list from the file

cr.db_fname is not defined

When I try to

 python htcap/htcap.py crawl http://testesseg.c3sl.ufpr.br:3000/ badstore.bd \; scan wapiti \; util report badstore.db badstore.html

The output gives me an error:

Scan finished
Traceback (most recent call last):
  File "htcap/htcap.py", line 83, in <module>
    dbfile = cr.db_filie if cr else None
AttributeError: Crawler instance has no attribute 'db_filie'

It is just a little confusion about the name of the variables in the Crawler class.

htcap.py

htcap/htcap.py

Line 83 in 19d3e2a

dbfile = cr.db_fname if cr else None

Crawler classes:

htcap/core/crawl/crawler.py

Line 61 in 19d3e2a

self.db_file = ""

No deduplication for POST requests

Hi and thanks for the awesome tool you're working on here.

I ran a crawl with htcap and initiated a scan with the collected requests I got. Otherwise it worked really well but POST requests with same parameters values were scanned multiple times. Looking at the code, there only seems to be deduplication for GET requests, whereas all POST requests are included in the crawl results even if they match previously found POSTs.

You could try to use the same method you currently have for GET deduplication, i.e. collect parameters, null out their values and sort them. Different body formats may need to be parsed here (at least form, JSON and XML) which requires some additional work.

Again, amazing work on the crawler so far!

Update to Python 3

Python 2.x will be EOL by the end of this year. Any plans to move this project to 3.x?

probe option scene : meta/url

<meta http-equiv=refresh content="1;url=ccnt/sczr/login" >

Make possible to action on HTML elements without ID

I was trying to put together a sequence that involves a clickToNativate action but the button I am trying to push does not have an ID that I could pick. I do not know if there is any way around it given its specification. If there is, I would be thankful if you kindly share what that is, otherwise would you consider adding e.g. XPATH support to define elements?

Thank you

ValueError: No JSON object could be decoded

No single page apps are crawled

Issuing following command (either installing locally or in a docker container):

The good:heavy_check_mark: :

# htcap crawl example.com target.db
Initializing . . . done
Database target-2.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
[=================================]   1 of 1 pages processed in 0 minutes
Crawl finished, 1 pages analyzed in 0 minutes

The bad ❌ :

# htcap crawl https://elm-spa-example.netlify.app/ target.db
Initializing . . . done
Database target-3.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
[                                 ]   0 of 1 pages processed in 10 minutes

The expected result would be to do any crawling, but nothing happens.
When opening chromium in non headless mode, one can see that no pages are opened. htcap stays on the about:local page.
Do I missing something or is it a bug?

ImportError: cannot import name 'Callable' from 'collections'

python3 htcap.py
Traceback (most recent call last):
File "/opt/tools/htcap/htcap.py", line 22, in
from core.crawl.crawler import Crawler
File "/opt/tools/htcap/core/crawl/crawler.py", line 39, in
from core.lib.http_get import HttpGet
File "/opt/tools/htcap/core/lib/http_get.py", line 29, in
import core.lib.thirdparty.pysocks.socks as socks
File "/opt/tools/htcap/core/lib/thirdparty/pysocks/socks.py", line 62, in
from collections import Callable
ImportError: cannot import name 'Callable' from 'collections' (/usr/lib/python3.10/collections

python3 -V
Python 3.10.4

After event(button click) triggered,page source updated,but program not analysis new page source

As issue title,we cannot get the link of click_link.php?id=2 in below website.
http://demo.aisec.cn/demo/aisec/

error json in crawl

iam bugbounty hunter
i test use htcap
and i have recurent error
i use up to date
~$ phantomjs --version
2.1.1

please descrive the solution

/opt/htcap/htcap$ sudo python ./htcap.py crawl -m aggressive -H https://www.xxxxxx.fr/fr/index.html cicfr.db
Initializing . . . done
Database cicfr.db initialized, crawl started with 10 threads
[ ] 0 of 31 pages processed in 0 minutesException in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Exception in thread Thread-2:

[feature] Limit the crawl to the provided domain

Like the existing -s domain option but with a limit to the domain.

For example, if the given url is "http://www.example.com", it will be limited to "www" subdomain and don't considere "blog.example.com" in scope.

Send raw arguments to Chrome

Hello.
What's about sending arbitrary arguments to chrome browser?

First of all I want to ignore any SSL errors:
chrome --ignore-certificate-errors --ignore-ssl-errors ...

Also very often we can faced with old/hidden sites (without valid dns):
chrome --host-resolver-rules="MAP old.company.int 1.2.3.4"

fails to read links from dynamically added elements

Improper Form Submission While Crawling SPAs

Hello,

Htcap is not submitting forms correctly while crawling SPAs.

For example, when I crawled the website https://brokencrystals.com with htcap, it didn't send the requests properly while crawling.

The actual login request looks like below, where the form is submitted to /api/auth/login endpoint with POST request and json body.
On the other hand, htcap sent a GET request with data in URL to /userlogin endpoint (which is a frontend page that does not handle any backend operations)

I have seen this same behavior multiple times while crawling other SPAs also.

[Feature request] Bash autocompletion

Make possible to resume a crawl with an existing database

The actual behavior is:

if no .db file create a new one.
if there is an existing one, create a new one with rename.
if option -w, overwrite the existing one.

Benefit

start and stop a crawl without re-crawl the whole site
having a single db for multiple assessment

sqlmap not working

hi your answers solve my problems in the older versions but the new one im having a problem with sqlmap i set the path to the directory where i have sqlmap but is not working can you please help me here is what im doing

def get_settings(self):
return dict(
request_types = "xhr,link,form,jsonp,redirect,fetch",
num_threads = 5,
process_timeout = 300,
scanner_exe = "/hacktools/webtools/htcap/sqlmap/sqlmap.py"

but not working this is all i get

Sqlmap executable not found in $PATH

Sporadic conflict with tablet driver

Hello there and thanks for your work! This error is exotic, sporadic and seems to go down to PhantomJS, but I think it will be useful to leave the report for it as I've found no similar reports for PhantomJS alone.

It's Windows 10.

On (usually second) htcap.py crawl command "There is a problem with your tablet driver" Windows error dialog pops up from PhantomJS process.

I've done some simple debugging of htcap, and bug sits in there PhantomsJS works on commands recieved by htcap.py crawl. Also, any subsequent invocation of PhantomJS (including phantomjs.exe --version) brings the same tablet driver error. Until tablet drivers are killed, both htcap crawl and PhantomJS hang and don't work properly.

This tablet driver is something that rigs Wacom tablets. Probably PhantomJS tries to interact with that input while being used by htcap, and that brings on the error. I'll try to debug things by myself and get back with a more definite report.

Can't get all form parameters

Hello,

When you try to crawl this site, "http://testaspnet.vulnweb.com/login.aspx" you see that there is a simple login form. After examining the crawl result I realized that some form parameter was missing.

The command:

python2 tools/htcap/htcap.py crawl -s url http://testaspnet.vulnweb.com/login.aspx test.db

The Htcap crawler result:

__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTIyMzk2OTgxMQ9kFgICAQ9kFgICAQ9kFgQCAQ8WBB4EaHJlZgUKbG9naW4uYXNweB4JaW5uZXJodG1sBQVsb2dpbmQCAw8WBB8AZB4HVmlzaWJsZWhkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQ9jYlBlcnNpc3RDb29raWXvu6wIgkhiLuMALZWDxQWHeEyxzQ%3D%3D&__VIEWSTATEGENERATOR=C2EE9ABB&__EVENTVALIDATION=%2FwEWWwKJvpbuCALStq24BwK3jsrkBALtuvfLDQKC3IeGDAK9zvyMDAKJ1YalCAKF%2Bb%2FgBQKF%2Bb%2FgBQKF%2BdOEDQKF%2BdOEDQKF%2BYfsCwKF%2BYfsCwKF%2BbuDAwKF%2BbuDAwKuk%2BfECQKuk%2BfECQKuk5ubAQKuk5ubAQKuk4%2B%2BCgKuk4%2B%2BCgKuk6PVAwKuk6PVAwKuk9fpDAKuk9fpDAKuk8uMBAKuk8uMBAKuk%2F%2BjDQKuk%2F%2BjDQKuk5PGBgKuk5PGBgKuk8evAwKuk8evAwKuk%2FvCDAKuk%2FvCDAKDusXrDwKDusXrDwKDuvmOBwKDuvmOBwKDuu0lAoO67SUCg7qB%2BAkCg7qB%2BAkCg7q1nwECg7q1nwECg7qpsgoCg7qpsgoCg7rd1gMCg7rd1gMCg7rx7QwCg7rx7QwCg7ql1QkCg7ql1QkCg7rZ6QICg7rZ6QICxtmYngcCxtmYngcCxtmMNQLG2Yw1AsbZoMgJAsbZoMgJAsbZ1OwCAsbZ1OwCAsbZyIMKAsbZyIMKAsbZ%2FKYDAsbZ%2FKYDAsbZkP0MAsbZkP0MAsbZhJAEAsbZhJAEAsbZ%2BPkCAsbZ%2BPkCAsbZ7JwKAsbZ7JwKAtvg%2FoUNAtvg%2FoUNAtvgktgGAtvgktgGAtvghv8PAtvghv8PAtvgupIHAtvgupIHAtvgrikC2%2BCuKQLb4MLNCQLb4MLNCQLb4PbgAgLb4PbgAgLb4OqHCgLb4OqHCkIzkaRk2Lc5p1%2BA0FodgqNefSMy&tbUsername=UTgXosHq&tbPassword=tGS.VU634.!&cbPersistCookie=on

The normal crawling result:

__EVENTARGUMENT=1&__EVENTTARGET=1&__EVENTVALIDATION=/wEWWwKJvpbuCALStq24BwK3jsrkBALtuvfLDQKC3IeGDAK9zvyMDAKJ1YalCAKF%2Bb/gBQKF%2Bb/gBQKF%2BdOEDQKF%2BdOEDQKF%2BYfsCwKF%2BYfsCwKF%2BbuDAwKF%2BbuDAwKuk%2BfECQKuk%2BfECQKuk5ubAQKuk5ubAQKuk4%2B%2BCgKuk4%2B%2BCgKuk6PVAwKuk6PVAwKuk9fpDAKuk9fpDAKuk8uMBAKuk8uMBAKuk/%2BjDQKuk/%2BjDQKuk5PGBgKuk5PGBgKuk8evAwKuk8evAwKuk/vCDAKuk/vCDAKDusXrDwKDusXrDwKDuvmOBwKDuvmOBwKDuu0lAoO67SUCg7qB%2BAkCg7qB%2BAkCg7q1nwECg7q1nwECg7qpsgoCg7qpsgoCg7rd1gMCg7rd1gMCg7rx7QwCg7rx7QwCg7ql1QkCg7ql1QkCg7rZ6QICg7rZ6QICxtmYngcCxtmYngcCxtmMNQLG2Yw1AsbZoMgJAsbZoMgJAsbZ1OwCAsbZ1OwCAsbZyIMKAsbZyIMKAsbZ/KYDAsbZ/KYDAsbZkP0MAsbZkP0MAsbZhJAEAsbZhJAEAsbZ%2BPkCAsbZ%2BPkCAsbZ7JwKAsbZ7JwKAtvg/oUNAtvg/oUNAtvgktgGAtvgktgGAtvghv8PAtvghv8PAtvgupIHAtvgupIHAtvgrikC2%2BCuKQLb4MLNCQLb4MLNCQLb4PbgAgLb4PbgAgLb4OqHCgLb4OqHCkIzkaRk2Lc5p1%2BA0FodgqNefSMy&__VIEWSTATE=/wEPDwUKLTIyMzk2OTgxMQ9kFgICAQ9kFgICAQ9kFgQCAQ8WBB4EaHJlZgUKbG9naW4uYXNweB4JaW5uZXJodG1sBQVsb2dpbmQCAw8WBB8AZB4HVmlzaWJsZWhkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQ9jYlBlcnNpc3RDb29raWXvu6wIgkhiLuMALZWDxQWHeEyxzQ==&__VIEWSTATEGENERATOR=C2EE9ABB&btnLogin=Login&cbPersistCookie=on&tbPassword=test&tbUsername=test

As you see the "btnLogin" parameter is missing in the Htcap crawling result and the missing parameter cause a problem.

Is there any workaround or missing command line parameters?

Greetings,

[Crawler] Wrong url retrieving on page with a <base> tag

When crawling a page with a <base href="…"> set in header, the crawler return relative path based on current path and not the one provided in the <base> tag.

To reproduce
Crawl the page with the html:

<!DOCTYPE html>
<html>
<head>
    <base href="http://somewhere.else/someWeirdPath/" target="_self">
</head>
<body>
<a href="1.html">page 1</a>
</body>
</html>

Result
Htcap found http://mycurrent.domain/1.html but should have found http://somewhere.else/someWeirdPath/1.html

Only errors displayed on the report

The urls with Errors are only displayed on the html report. Rest no urls are visible. I tried using https://htcap.org/scanme/ but still same output

Errors3

probe_killed
probe_failure
HTTP Error 400: Bad Request

Command:
./htcap.py crawl https://htcap.org htcap.db -v
Initializing . . . done
Database htcap-2.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
[================== ] 5 of 9 pages processed in 0 minutes^C
Crawler is paused.
r resume
v verbose mode
p show progress bar
q quiet mode
Hit ctrl-c again to exit

v
Crawler is running
crawl result for: redirect GET https://htcap.org/scanme/ng/

crawler errors: probe_killed, probe_failure
new request found link GET https://htcap.org/scanme/login/
crawl result for: redirect GET https://htcap.org/scanme/db_screen.png
crawler errors: probe_killed, probe_failure, content_type

Move to headless chrome

Phantomjs is no longer under development so we need to move to headless Chrome

Error: unable to open url: <urlopen error no host given>

I'm trying to use htcap with BurpSuite. However that always fails. Any hints?

root@kali:~/offsec/htcap# python htcap.py crawl -p http://localhost:8080 heise.de test.db
Initializing . 
Error: unable to open url: <urlopen error no host given>
root@kali:~/offsec/htcap#

Without the -p flag it works.

sporadic null id_parent in the database

Maybe just some Warning!

Hello!

First amazing script, lately got more familiar with it, and now use almost in every security assesment! It helps me to get very clear insight of web application and makes job more easy! Have some ideas and functions what maybe could be added! When ready gonna let you know you!

Second thing i noticed that on large scale crawl, I,m getting this error: Database karta.db initialized, crawl started with 10 threads
[==== ] 966 of 7928 pages processed in 67 minutes Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 212, in crawl
probe = self.send_probe(request, errors)
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 161, in send_probe
probeArray = self.load_probe_json(jsn)
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 4 column 1 - line 4 column 249 (char 58 - 306)
Maybe its just my system warning! I,m Using Kali Linux but its highly modified and build up for me and maybe because of this!

It seems that there is no impact on work flow, it continues crawling and after when working on database there is no error!

Anyways good script and good luck developing it!

bug

Exception in thread Thread-1:
Traceback (most recent call last):
File "E:\Python2\lib\threading.py", line 810, in _bootstrap_inner
self.run()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 64, in run
self.crawl()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 147, in crawl
probe = self.send_probe(request, errors)
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 105, in send_probe
process_timeout=Shared.options['process_timeout']
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 210, in execute
probeArray = self.load_probe_json(jsn)
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 126, in load_probe_json
return json.loads(jsn)
File "E:\Python2\lib\json_init.py", line 338, in loads
return _default_decoder.decode(s)
File "E:\Python2\lib\json\decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 55 column 1 - line 56 column 338 (char 15049 - 17940)

Exception in thread Thread-3:
Traceback (most recent call last):
File "E:\Python2\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 64, in run
self.crawl()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 147, in crawl
probe = self.send_probe(request, errors)
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 105, in send_probe
process_timeout=Shared.options['process_timeout']
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 198, in execute
os.unlink(self.out_file)
WindowsError: [Error 32] : u'c:\users\admin\appdata\local\temp\htcap_output-4b54a465-e420-4c58-88b9-702e80ba8f20.json'

Traceback (most recent call last):
File "E:\Python2\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 64, in run
self.crawl()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 147, in crawl
probe = self.send_probe(request, errors)
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 105, in send_probe
process_timeout=Shared.options['process_timeout']
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 210, in execute
probeArray = self.load_probe_json(jsn)
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 128, in load_probe_json
print "-- JSON DECODE ERROR %s" % jsn
IOError: [Errno 0] Error

Overwritten "setTimeout" and "setInterval" do not transmit provided arguments.

During JS crawling, the page script containing setTimeout and setInterval do not receive the provided arguments.
The signature of the function should be (since the delay is removed):
var intervalID = scope.setInterval(func[, param1, param2, ...]);
but actually is:
var intervalID = scope.setInterval(func);

[Feature request]Allow sending headers for auth

Some APIs are protected by a Bearer token or some other form of http header based auth

problematic parameter(-x)

command:python htcap.py crawl -p http:127.0.0.1:8080 -m aggressive -c "security_level=0; PHPSESSID=1ln04buglpc7ljdt95nu0r4a75" -x '.logout.' http://192.168.88.136/bWAPP/bWAPP/sqli_6.php baidu3.db
-x can't exclude (regex) logout urls,
and script can't auto click button to post requests.

Short Video in https://htcap.org is so cool! But i don't know how to repeat.

arachni or sqlmap wont work

in the /core/scan/scanner/arachni.py same for sqlmap.py

i change the scanner exe /usr/share/arachni/bin/arachni
to /home/arachni/bin/arachni because that's where i have the arachni framework but doesn't work
it always worked for me in the previous versions https://github.com/Vulnerability-scanner/htcap
but in this version i only get this arachni executable not found can you please help ?

execution error

I am getting this error when run crawl command:
Exception in thread Thread-31:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/opt/htcap/core/lib/shell.py", line 44, in executor
self.process = subprocess.Popen(self.cmd,stderr=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0, close_fds=sys.platform != "win32")
File "/usr/lib/python2.7/subprocess.py", line 710, in __init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error

Exception in thread Thread-30:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/opt/htcap/core/lib/shell.py", line 44, in executor
self.process = subprocess.Popen(self.cmd,stderr=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0, close_fds=sys.platform != "win32")
File "/usr/lib/python2.7/subprocess.py", line 710, in __init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error

Exception in thread Thread-32:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/opt/htcap/core/lib/shell.py", line 44, in executor
self.process = subprocess.Popen(self.cmd,stderr=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0, close_fds=sys.platform != "win32")
File "/usr/lib/python2.7/subprocess.py", line 710, in __init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error

Exception in thread Thread-33:

System is Backbox 32 bit

Proxy not used when crawling on localhost network

When launching a crawl, it seems that only the start url and robots.txt are requested through the proxy (during the validation process).

way to reproduce:

start a crawl with:
$ ./htcap.py crawl -v -p http:127.0.0.1:8080 http://localhost/index.html test.db
you get:

Initializing . . done
Database test.db initialized, crawl started with 10 threads
crawl result for: link GET http://localhost/index.html  
  new request found link GET http://localhost/test1.html 
crawl result for: link GET http://localhost/test1.html  
  new request found link GET http://localhost/test2.html 
  new request found link GET http://localhost/index.html 
crawl result for: link GET http://localhost/test2.html  
  new request found link GET http://localhost/test1.html 
  new request found link GET http://localhost/index.html 

Crawl finished, 3 pages analyzed in 0 minutes

I only got 2 hits in the proxy log:
- http://…/index.html
- http://…/robots.txt

Unable to crawl our webapp..

Hi,
Below is the stack trace ::
root@blr-1st-1-dhcp622:~/Dump/htcap# python htcap.py crawl https://i-cant-tell-you-this.com test.db
Initializing . . Traceback (most recent call last):
File "htcap.py", line 49, in
Crawler(sys.argv[2:])
File "/root/Dump/htcap/core/crawl/crawler.py", line 82, in init
self.main(argv)
File "/root/Dump/htcap/core/crawl/crawler.py", line 557, in main
start_requests = self.init_crawl(start_req, initial_checks, get_robots_txt)
File "/root/Dump/htcap/core/crawl/crawler.py", line 359, in init_crawl
rrequests = self.get_requests_from_robots(start_req)
File "/root/Dump/htcap/core/crawl/crawler.py", line 203, in get_requests_from_robots
lines = httpget.get_file().split("\n")
File "/root/Dump/htcap/core/lib/http_get.py", line 209, in get_file
res = opener.open(req, None, self.timeout)
File "/usr/lib/python2.7/urllib2.py", line 435, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 548, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 467, in error
result = self._call_chain(args)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(args)
File "/usr/lib/python2.7/urllib2.py", line 654, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1121, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 438, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 394, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer

curl and python requests is able to GET the page though

Extra error when crawling

I was trying to crawl a website with -m active -v. I am getting these errors. Could you please look into it,
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 69 - 317)

Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 341 - 589)

[Question] Working with login forms

Hi Htcap team,

Thanks for the wonderful work (really, your tool is awesome).
I'd just like to ask a quick question:

Do you know how can we work with applications that require user login?
I'm able to know precisely on which page the login page is. I've got some credentials to test. But, I don't know where in the tool I can tell to use those credentials for login. By default, the tool is using default credentials, which don't work on my application.

As I believe that many other people are also asking this question, it can also be worth adding it on a wiki or some documentation page.

Thanks a lot