fcavallarin / htcap Goto Github PK
View Code? Open in Web Editor NEWhtcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.
License: GNU General Public License v2.0
htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.
License: GNU General Public License v2.0
Hello . Thanks for the wonderful tools
When see the document i found something follow.
The aggressive mode makes htcap to also fill input values and post forms. This simulates a user that performs as many actions as possible on the page.
But when i test for a login form it does not fill input values.
like this
python htcap.py crawl -v http://1.1.1.1:8080/ target.db
Initializing . . . done
Database target-5.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
crawl result for: link GET http://1.1.1.1:8080/
new request found form POST http://1.1.1.1:8080/login name=&pwd=&code=
crawl result for: form POST http://1.1.1.1:8080/login name=&pwd=&code=
What i think it should like this name=aaa&pwd=aaa&code=111
is anything wrong?
Hi htcap authors.
I have recently discovered your tool and started using it for pen testing a SPA application based on Angular and Spring Boot. The security of the application is based on tokens, which must be provided on every HTTP request to the REST API as a header (ex. Auth: Basic tokenvalue).
Unfortunately, I am not able to correctly use the crawler, which stops at the login page. I provided the required header using the parameter -E (ex. python htcap.py crawl -E 'Auth=Basic tokenvalue' target dest) and it does not work (double checked with Wireshark, which does not show the Header being added in the requests).
I also tried using the credentials parameter (ex. python htcap.py crawl -l -A 'user:pass' target dest), but it does not work either. When trying to login to the application, the crawler uses random strings each time.
Is this is a bug or I am using your tool wrong? Could you please provide some information?
I wanted to integrate htcap in one of my docker container and all the rest of my code is in Python 3.
Aside my own usage, here is the advantage to migrate to python 3:
It should be like this:
"""
HTCAP - beta 1
Author: [email protected]
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
"""
from future import unicode_literals
import sys
import os
import datetime
import time
import getopt
import json
import re
from urlparse import urlsplit, urljoin
from urllib import unquote
import urllib2
import threading
import subprocess
from random import choice
import string
import ssl
import signal
from core.lib.exception import *
from core.lib.cookie import Cookie
from core.lib.database import Database
from lib.shared import *
from lib.crawl_result import *
from core.lib.request import Request
from core.lib.http_get import HttpGet
from core.lib.shell import CommandExecutor
from crawler_thread import CrawlerThread
#from core.lib.shingleprint import ShinglePrint
from core.lib.texthash import TextHash
from core.lib.request_pattern import RequestPattern
from core.lib.utils import *
from core.constants import *
from lib.utils import *
from core.lib.progressbar import Progressbar
class Crawler:
def __init__(self, argv):
self.base_dir = getrealdir(__file__) + os.sep
self.crawl_start_time = int(time.time())
self.crawl_end_time = None
self.page_hashes = []
self.request_patterns = []
self.db_file = ""
self.display_progress = True
self.verbose = False
self.defaults = {
"useragent": 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3582.0 Safari/537.36',
"num_threads": 10,
"max_redirects": 10,
"out_file_overwrite": False,
"proxy": None,
"http_auth": None,
"use_urllib_onerror": True,
"group_qs": False,
"process_timeout": 300,
"scope": CRAWLSCOPE_DOMAIN,
"mode": CRAWLMODE_AGGRESSIVE,
"max_depth": 100,
"max_post_depth": 10,
"override_timeout_functions": True,
'crawl_forms': True, # only if mode == CRAWLMODE_AGGRESSIVE
'deduplicate_pages': True,
'headless_chrome': True,
'extra_headers': False,
'login_sequence': None,
'simulate_real_events': True
}
self.main(argv)
def usage(self):
infos = get_program_infos()
print ("htcap crawler ver " + infos['version'] + "\n"
"usage: crawl [options] url outfile\n"
"hit ^C to pause the crawler or change verbosity\n"
"Options: \n"
" -h this help\n"
" -w overwrite output file\n"
" -q do not display progress informations\n"
" -v be verbose\n"
" -m MODE set crawl mode:\n"
" - "+CRAWLMODE_PASSIVE+": do not intract with the page\n"
" - "+CRAWLMODE_ACTIVE+": trigger events\n"
" - "+CRAWLMODE_AGGRESSIVE+": also fill input values and crawl forms (default)\n"
" -s SCOPE set crawl scope\n"
" - "+CRAWLSCOPE_DOMAIN+": limit crawling to current domain (default)\n"
" - "+CRAWLSCOPE_DIRECTORY+": limit crawling to current directory (and subdirecotries) \n"
" - "+CRAWLSCOPE_URL+": do not crawl, just analyze a single page\n"
" -D maximum crawl depth (default: " + str(Shared.options['max_depth']) + ")\n"
" -P maximum crawl depth for consecutive forms (default: " + str(Shared.options['max_post_depth']) + ")\n"
" -F even if in aggressive mode, do not crawl forms\n"
" -H save HTML generated by the page\n"
" -d DOMAINS comma separated list of allowed domains (ex *.target.com)\n"
" -c COOKIES cookies as json or name=value pairs separaded by semicolon\n"
" -C COOKIE_FILE path to file containing COOKIES \n"
" -r REFERER set initial referer\n"
" -x EXCLUDED comma separated list of urls to exclude (regex) - ie logout urls\n"
" -p PROXY proxy string protocol:host:port - protocol can be 'http' or 'socks5'\n"
" -n THREADS number of parallel threads (default: " + str(self.defaults['num_threads']) + ")\n"
" -A CREDENTIALS username and password used for HTTP authentication separated by a colon\n"
" -U USERAGENT set user agent\n"
" -t TIMEOUT maximum seconds spent to analyze a page (default " + str(self.defaults['process_timeout']) + ")\n"
" -S skip initial checks\n"
" -G group query_string parameters with the same name ('[]' ending excluded)\n"
" -N don't normalize URL path (keep ../../)\n"
" -R maximum number of redirects to follow (default " + str(self.defaults['max_redirects']) + ")\n"
" -I ignore robots.txt\n"
" -O dont't override timeout functions (setTimeout, setInterval)\n"
" -e disable hEuristic page deduplication\n"
" -l do not run chrome in headless mode\n"
" -E HEADER set extra http headers (ex -E foo=bar -E bar=foo)\n"
" -M don't simulate real mouse/keyboard events\n"
" -L SEQUENCE set login sequence\n"
)
def generate_filename(self, name, out_file_overwrite):
fname = generate_filename(name, None, out_file_overwrite)
if out_file_overwrite:
if os.path.exists(fname):
os.remove(fname)
return fname
def kill_threads(self, threads):
Shared.th_condition.acquire()
for th in threads:
if th.isAlive():
th.exit = True
th.pause = False
if th.probe_executor and th.probe_executor.cmd:
th.probe_executor.cmd.terminate()
Shared.th_condition.release()
# start notify() chain
Shared.th_condition.acquire()
Shared.th_condition.notifyAll()
Shared.th_condition.release()
def pause_threads(self, threads, pause):
Shared.th_condition.acquire()
for th in threads:
if th.isAlive():
th.pause = pause
Shared.th_condition.release()
def init_db(self, dbname, report_name):
infos = {
"target": Shared.starturl,
"scan_date": -1,
"urls_scanned": -1,
"scan_time": -1,
'command_line': " ".join(sys.argv)
}
database = Database(dbname, report_name, infos)
database.create()
return database
def check_startrequest(self, request):
h = HttpGet(request, Shared.options['process_timeout'], 2, Shared.options['useragent'], Shared.options['proxy'])
try:
h.get_requests()
except NotHtmlException:
print "\nError: Document is not html"
sys.exit(1)
except Exception as e:
print "\nError: unable to open url: %s" % e
sys.exit(1)
def get_requests_from_robots(self, request):
purl = urlsplit(request.url)
url = "%s://%s/robots.txt" % (purl.scheme, purl.netloc)
getreq = Request(REQTYPE_LINK, "GET", url, extra_headers=Shared.options['extra_headers'])
try:
# request, timeout, retries=None, useragent=None, proxy=None):
httpget = HttpGet(getreq, 10, 1, "Googlebot", Shared.options['proxy'])
lines = httpget.get_file().split("\n")
except urllib2.HTTPError:
return []
except:
return []
#raise
requests = []
for line in lines:
directive = ""
url = None
try:
directive, url = re.sub("\#.*","",line).split(":",1)
except:
continue # ignore errors
if re.match("(dis)?allow", directive.strip(), re.I):
req = Request(REQTYPE_LINK, "GET", url.strip(), parent=request)
requests.append(req)
return adjust_requests(requests) if requests else []
def randstr(self, length):
all_chars = string.digits + string.ascii_letters + string.punctuation
random_string = ''.join(choice(all_chars) for _ in range(length))
return random_string
def request_is_duplicated(self, page_hash):
for h in self.page_hashes:
if TextHash.compare(page_hash, h):
return True
return False
def main_loop(self, threads, start_requests, database):
pending = len(start_requests)
crawled = 0
pb = Progressbar(self.crawl_start_time, "pages processed")
req_to_crawl = start_requests
while True:
try:
if self.display_progress and not self.verbose:
tot = (crawled + pending)
pb.out(tot, crawled)
if pending == 0:
# is the check of running threads really needed?
running_threads = [t for t in threads if t.status == THSTAT_RUNNING]
if len(running_threads) == 0:
if self.display_progress or self.verbose:
print ""
break
if len(req_to_crawl) > 0:
Shared.th_condition.acquire()
Shared.requests.extend(req_to_crawl)
Shared.th_condition.notifyAll()
Shared.th_condition.release()
req_to_crawl = []
Shared.main_condition.acquire()
Shared.main_condition.wait(1)
if len(Shared.crawl_results) > 0:
database.connect()
database.begin()
for result in Shared.crawl_results:
crawled += 1
pending -= 1
if self.verbose:
print "crawl result for: %s " % result.request
if len(result.request.user_output) > 0:
print " user: %s" % json.dumps(result.request.user_output)
if result.errors:
print "* crawler errors: %s" % ", ".join(result.errors)
database.save_crawl_result(result, True)
if Shared.options['deduplicate_pages']:
if self.request_is_duplicated(result.page_hash):
filtered_requests = []
for r in result.found_requests:
if RequestPattern(r).pattern not in self.request_patterns:
filtered_requests.append(r)
result.found_requests = filtered_requests
if self.verbose:
print " * marked as duplicated ... requests filtered"
self.page_hashes.append(result.page_hash)
for r in result.found_requests:
self.request_patterns.append(RequestPattern(r).pattern)
for req in result.found_requests:
database.save_request(req)
if self.verbose and req not in Shared.requests and req not in req_to_crawl:
print " new request found %s" % req
if request_is_crawlable(req) and req not in Shared.requests and req not in req_to_crawl:
if request_depth(req) > Shared.options['max_depth'] or request_post_depth(req) > Shared.options['max_post_depth']:
if self.verbose:
print " * cannot crawl: %s : crawl depth limit reached" % req
result = CrawlResult(req, errors=[ERROR_CRAWLDEPTH])
database.save_crawl_result(result, False)
continue
if req.redirects > Shared.options['max_redirects']:
if self.verbose:
print " * cannot crawl: %s : too many redirects" % req
result = CrawlResult(req, errors=[ERROR_MAXREDIRECTS])
database.save_crawl_result(result, False)
continue
pending += 1
req_to_crawl.append(req)
Shared.crawl_results = []
database.commit()
database.close()
Shared.main_condition.release()
except KeyboardInterrupt:
try:
Shared.main_condition.release()
Shared.th_condition.release()
except:
pass
self.pause_threads(threads, True)
if not self.get_runtime_command():
print "Exiting . . ."
return
print "Crawler is running"
self.pause_threads(threads, False)
def get_runtime_command(self):
while True:
print (
"\nCrawler is paused.\n"
" r resume\n"
" v verbose mode\n"
" p show progress bar\n"
" q quiet mode\n"
"Hit ctrl-c again to exit\n"
)
try:
ui = raw_input("> ").strip()
except KeyboardInterrupt:
print ""
return False
if ui == "r":
break
elif ui == "v":
self.verbose = True
break
elif ui == "p":
self.display_progress = True
self.verbose = False
break
elif ui == "q":
self.verbose = False
self.display_progress = False
break
print " "
return True
def init_crawl(self, start_req, check_starturl, get_robots_txt):
start_requests = [start_req]
try:
if check_starturl:
self.check_startrequest(start_req)
stdoutw(". ")
if get_robots_txt:
rrequests = self.get_requests_from_robots(start_req)
stdoutw(". ")
for req in rrequests:
if request_is_crawlable(req) and not req in start_requests:
start_requests.append(req)
except KeyboardInterrupt:
print "\nAborted"
sys.exit(0)
return start_requests
def main(self, argv):
Shared.options = self.defaults
Shared.th_condition = threading.Condition()
Shared.main_condition = threading.Condition()
deps_errors = check_dependences(self.base_dir)
if len(deps_errors) > 0:
print "Dependences errors: "
for err in deps_errors:
print " %s" % err
sys.exit(1)
start_cookies = []
start_referer = None
probe_options = ["-R", self.randstr(20)]
threads = []
num_threads = self.defaults['num_threads']
out_file = ""
out_file_overwrite = self.defaults['out_file_overwrite']
cookie_string = None
initial_checks = True
http_auth = None
get_robots_txt = True
save_html = False
try:
opts, args = getopt.getopt(argv[2:], 'hc:t:jn:x:A:p:d:BGR:U:wD:s:m:C:qr:SIHFP:OvelE:L:M')
except getopt.GetoptError as err:
print str(err)
sys.exit(1)
if len(argv) < 2:
self.usage()
sys.exit(1)
for o, v in opts:
if o == '-h':
self.usage()
sys.exit(0)
elif o == '-c':
cookie_string = v
elif o == '-C':
try:
with open(v) as cf:
cookie_string = cf.read()
except Exception as e:
print "error reading cookie file"
sys.exit(1)
elif o == '-r':
start_referer = v
elif o == '-n':
num_threads = int(v)
elif o == '-t':
Shared.options['process_timeout'] = int(v)
elif o == '-q':
self.display_progress = False
elif o == '-A':
http_auth = v
elif o == '-p':
try:
Shared.options['proxy'] = parse_proxy_string(v)
except Exception as e:
print e
sys.exit(1)
elif o == '-d':
for ad in v.split(","):
# convert *.domain.com to *.\.domain\.com
pattern = re.escape(ad).replace("\\*\\.","((.*\\.)|)")
Shared.allowed_domains.add(pattern)
elif o == '-x':
for eu in v.split(","):
try:
re.match(eu, "")
except:
print "* ERROR: regex failed: %s" % eu
sys.exit(1)
Shared.excluded_urls.add(eu)
elif o == "-G":
Shared.options['group_qs'] = True
elif o == "-w":
out_file_overwrite = True
elif o == "-R":
Shared.options['max_redirects'] = int(v)
elif o == "-U":
Shared.options['useragent'] = v
elif o == "-s":
if not v in (CRAWLSCOPE_DOMAIN, CRAWLSCOPE_DIRECTORY, CRAWLSCOPE_URL):
self.usage()
print "* ERROR: wrong scope set '%s'" % v
sys.exit(1)
Shared.options['scope'] = v
elif o == "-m":
if not v in (CRAWLMODE_PASSIVE, CRAWLMODE_ACTIVE, CRAWLMODE_AGGRESSIVE):
self.usage()
print "* ERROR: wrong mode set '%s'" % v
sys.exit(1)
Shared.options['mode'] = v
elif o == "-S":
initial_checks = False
elif o == "-I":
get_robots_txt = False
elif o == "-H":
save_html = True
elif o == "-D":
Shared.options['max_depth'] = int(v)
elif o == "-P":
Shared.options['max_post_depth'] = int(v)
elif o == "-O":
Shared.options['override_timeout_functions'] = False
elif o == "-F":
Shared.options['crawl_forms'] = False
elif o == "-v":
self.verbose = True
elif o == "-e":
Shared.options['deduplicate_pages'] = False
elif o == "-l":
Shared.options['headless_chrome'] = False
elif o == "-M":
Shared.options['simulate_real_events'] = False
elif o == "-E":
if not Shared.options['extra_headers']:
Shared.options['extra_headers'] = {}
(hn, hv) = v.split("=", 1)
Shared.options['extra_headers'][hn] = hv
elif o == "-L":
try:
with open(v) as cf:
Shared.options['login_sequence'] = json.loads(cf.read())
Shared.options['login_sequence']["__file__"] = os.path.abspath(v)
except ValueError as e:
print "* ERROR: decoding login sequence"
sys.exit(1)
except Exception as e:
print "* ERROR: login sequence file not found"
sys.exit(1)
probe_cmd = get_node_cmd()
if not probe_cmd: # maybe useless
print "Error: unable to find node executable"
sys.exit(1)
if Shared.options['scope'] != CRAWLSCOPE_DOMAIN and len(Shared.allowed_domains) > 0:
print "* Warinig: option -d is valid only if scope is %s" % CRAWLSCOPE_DOMAIN
if cookie_string:
try:
start_cookies = parse_cookie_string(cookie_string)
except Exception as e:
print "error decoding cookie string"
sys.exit(1)
if Shared.options['mode'] != CRAWLMODE_AGGRESSIVE:
probe_options.append("-f") # dont fill values
if Shared.options['mode'] == CRAWLMODE_PASSIVE:
probe_options.append("-t") # dont trigger events
if Shared.options['proxy']:
probe_options.extend(["-y", "%s:%s:%s" % (Shared.options['proxy']['proto'], Shared.options['proxy']['host'], Shared.options['proxy']['port'])])
if not Shared.options['headless_chrome']:
probe_options.append("-l")
probe_cmd.append(os.path.join(self.base_dir, 'probe', 'analyze.js'))
if len(Shared.excluded_urls) > 0:
probe_options.extend(("-X", ",".join(Shared.excluded_urls)))
if save_html:
probe_options.append("-H")
probe_options.extend(("-x", str(Shared.options['process_timeout'])))
probe_options.extend(("-A", Shared.options['useragent']))
if not Shared.options['override_timeout_functions']:
probe_options.append("-O")
if Shared.options['extra_headers']:
probe_options.extend(["-E", json.dumps(Shared.options['extra_headers'])])
if not Shared.options['simulate_real_events']:
probe_options.append("-M")
Shared.probe_cmd = probe_cmd + probe_options
Shared.starturl = normalize_url(argv[0])
out_file = argv[1]
purl = urlsplit(Shared.starturl)
Shared.allowed_domains.add(purl.hostname)
if Shared.options['login_sequence'] and Shared.options['login_sequence']['type'] == LOGSEQTYPE_SHARED:
login_req = Request(REQTYPE_LINK, "GET", Shared.options['login_sequence']['url'],
set_cookie=Shared.start_cookies,
http_auth=http_auth,
referer=start_referer,
extra_headers=Shared.options['extra_headers']
)
stdoutw("Logging in . . . ")
try:
pe = ProbeExecutor(login_req, Shared.probe_cmd + ["-z"], login_sequence=Shared.options['login_sequence'])
probe = pe.execute()
if not probe:
print "\n* ERROR: login sequence failed to execute probe"
sys.exit(1)
if probe.status == "ok":
for c in probe.cookies:
if not Shared.options['login_sequence']['cookies'] or c.name in Shared.options['login_sequence']['cookies']:
Shared.start_cookies.append(c)
else:
print "\n* ERROR: login sequence failed:\n %s" % probe.errmessage
sys.exit(1)
except KeyboardInterrupt:
pe.terminate()
print "\nAborted"
sys.exit(0)
print "done"
for sc in start_cookies:
Shared.start_cookies.append(Cookie(sc, Shared.starturl))
start_req = Request(REQTYPE_LINK, "GET", Shared.starturl,
set_cookie=Shared.start_cookies,
http_auth=http_auth,
referer=start_referer,
extra_headers=Shared.options['extra_headers']
)
if not hasattr(ssl, "SSLContext"):
print "* WARNING: SSLContext is not supported with this version of python, consider to upgrade to >= 2.7.9 in case of SSL errors"
stdoutw("Initializing . ")
start_requests = self.init_crawl(start_req, initial_checks, get_robots_txt)
database = None
self.db_file = self.generate_filename(out_file, out_file_overwrite)
try:
database = self.init_db(self.db_file, out_file)
except Exception as e:
print str(e)
sys.exit(1)
database.save_crawl_info(
htcap_version = get_program_infos()['version'],
target = Shared.starturl,
start_date = self.crawl_start_time,
commandline = cmd_to_str(argv),
user_agent = Shared.options['useragent'],
proxy = json.dumps(Shared.options['proxy']),
extra_headers = json.dumps(Shared.options['extra_headers']),
cookies = json.dumps([x.get_dict() for x in Shared.start_cookies])
)
database.connect()
database.begin()
for req in start_requests:
database.save_request(req)
database.commit()
database.close()
print "done"
print "Database %s initialized, crawl started with %d threads (^C to pause or change verbosity)" % (self.db_file, num_threads)
for n in range(0, num_threads):
thread = CrawlerThread()
threads.append(thread)
thread.start()
self.main_loop(threads, start_requests, database)
self.kill_threads(threads)
self.crawl_end_time = int(time.time())
print "Crawl finished, %d pages analyzed in %d minutes" % (Shared.requests_index, (self.crawl_end_time - self.crawl_start_time) / 60)
database.save_crawl_info(end_date=self.crawl_end_time)
Site relying on window.location
for navigation are not crawled properly.
A crawl on a page like this one will not see all the site:
<html>
<head>
<script>
function go1(){
window.location = "../1.php";
}
function go2(){
window.location = "../2.php";
}
function go3(){
window.location = "../3.php";
}
</script>
</head>
<body>
<span onclick="go1()">click here </span>
<a onclick="go2()">click here </a>
<td onclick="go3()">click here </td>
</body>
</html>
When I try to
python htcap/htcap.py crawl http://testesseg.c3sl.ufpr.br:3000/ badstore.bd \; scan wapiti \; util report badstore.db badstore.html
The output gives me an error:
Scan finished
Traceback (most recent call last):
File "htcap/htcap.py", line 83, in <module>
dbfile = cr.db_filie if cr else None
AttributeError: Crawler instance has no attribute 'db_filie'
It is just a little confusion about the name of the variables in the Crawler class.
htcap.py
Line 83 in 19d3e2a
Crawler classes:
Line 61 in 19d3e2a
Hi and thanks for the awesome tool you're working on here.
I ran a crawl with htcap and initiated a scan with the collected requests I got. Otherwise it worked really well but POST requests with same parameters values were scanned multiple times. Looking at the code, there only seems to be deduplication for GET requests, whereas all POST requests are included in the crawl results even if they match previously found POSTs.
You could try to use the same method you currently have for GET deduplication, i.e. collect parameters, null out their values and sort them. Different body formats may need to be parsed here (at least form, JSON and XML) which requires some additional work.
Again, amazing work on the crawler so far!
Python 2.x will be EOL by the end of this year. Any plans to move this project to 3.x?
<meta http-equiv=refresh content="1;url=ccnt/sczr/login" >
I was trying to put together a sequence that involves a clickToNativate action but the button I am trying to push does not have an ID that I could pick. I do not know if there is any way around it given its specification. If there is, I would be thankful if you kindly share what that is, otherwise would you consider adding e.g. XPATH support to define elements?
Thank you
Issuing following command (either installing locally or in a docker container):
The good:heavy_check_mark: :
# htcap crawl example.com target.db
Initializing . . . done
Database target-2.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
[=================================] 1 of 1 pages processed in 0 minutes
Crawl finished, 1 pages analyzed in 0 minutes
The bad ❌ :
# htcap crawl https://elm-spa-example.netlify.app/ target.db
Initializing . . . done
Database target-3.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
[ ] 0 of 1 pages processed in 10 minutes
The expected result would be to do any crawling, but nothing happens.
When opening chromium
in non headless mode, one can see that no pages are opened. htcap
stays on the about:local
page.
Do I missing something or is it a bug?
python3 htcap.py
Traceback (most recent call last):
File "/opt/tools/htcap/htcap.py", line 22, in
from core.crawl.crawler import Crawler
File "/opt/tools/htcap/core/crawl/crawler.py", line 39, in
from core.lib.http_get import HttpGet
File "/opt/tools/htcap/core/lib/http_get.py", line 29, in
import core.lib.thirdparty.pysocks.socks as socks
File "/opt/tools/htcap/core/lib/thirdparty/pysocks/socks.py", line 62, in
from collections import Callable
ImportError: cannot import name 'Callable' from 'collections' (/usr/lib/python3.10/collections
python3 -V
Python 3.10.4
As issue title,we cannot get the link of click_link.php?id=2 in below website.
http://demo.aisec.cn/demo/aisec/
iam bugbounty hunter
i test use htcap
and i have recurent error
i use up to date
~$ phantomjs --version
2.1.1
please descrive the solution
/opt/htcap/htcap$ sudo python ./htcap.py crawl -m aggressive -H https://www.xxxxxx.fr/fr/index.html cicfr.db
Initializing . . . done
Database cicfr.db initialized, crawl started with 10 threads
[ ] 0 of 31 pages processed in 0 minutesException in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/opt/htcap/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Exception in thread Thread-2:
Like the existing -s domain
option but with a limit to the domain.
For example, if the given url is "http://www.example.com", it will be limited to "www" subdomain and don't considere "blog.example.com" in scope.
Hello.
What's about sending arbitrary arguments to chrome browser?
First of all I want to ignore any SSL errors:
chrome --ignore-certificate-errors --ignore-ssl-errors ...
Also very often we can faced with old/hidden sites (without valid dns):
chrome --host-resolver-rules="MAP old.company.int 1.2.3.4"
Hello,
Htcap is not submitting forms correctly while crawling SPAs.
For example, when I crawled the website https://brokencrystals.com with htcap, it didn't send the requests properly while crawling.
The actual login request looks like below, where the form is submitted to /api/auth/login
endpoint with POST request and json body.
On the other hand, htcap sent a GET request with data in URL to /userlogin
endpoint (which is a frontend page that does not handle any backend operations)
I have seen this same behavior multiple times while crawling other SPAs also.
The actual behavior is:
-w
, overwrite the existing one.hi your answers solve my problems in the older versions but the new one im having a problem with sqlmap i set the path to the directory where i have sqlmap but is not working can you please help me here is what im doing
def get_settings(self):
return dict(
request_types = "xhr,link,form,jsonp,redirect,fetch",
num_threads = 5,
process_timeout = 300,
scanner_exe = "/hacktools/webtools/htcap/sqlmap/sqlmap.py"
but not working this is all i get
Sqlmap executable not found in $PATH
Hello there and thanks for your work! This error is exotic, sporadic and seems to go down to PhantomJS, but I think it will be useful to leave the report for it as I've found no similar reports for PhantomJS alone.
It's Windows 10.
On (usually second) htcap.py crawl
command "There is a problem with your tablet driver" Windows error dialog pops up from PhantomJS process.
I've done some simple debugging of htcap, and bug sits in there PhantomsJS works on commands recieved by htcap.py crawl
. Also, any subsequent invocation of PhantomJS (including phantomjs.exe --version
) brings the same tablet driver error. Until tablet drivers are killed, both htcap crawl
and PhantomJS hang and don't work properly.
This tablet driver is something that rigs Wacom tablets. Probably PhantomJS tries to interact with that input while being used by htcap, and that brings on the error. I'll try to debug things by myself and get back with a more definite report.
Hello,
When you try to crawl this site, "http://testaspnet.vulnweb.com/login.aspx" you see that there is a simple login form. After examining the crawl result I realized that some form parameter was missing.
The command:
python2 tools/htcap/htcap.py crawl -s url http://testaspnet.vulnweb.com/login.aspx test.db
The Htcap crawler result:
__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTIyMzk2OTgxMQ9kFgICAQ9kFgICAQ9kFgQCAQ8WBB4EaHJlZgUKbG9naW4uYXNweB4JaW5uZXJodG1sBQVsb2dpbmQCAw8WBB8AZB4HVmlzaWJsZWhkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQ9jYlBlcnNpc3RDb29raWXvu6wIgkhiLuMALZWDxQWHeEyxzQ%3D%3D&__VIEWSTATEGENERATOR=C2EE9ABB&__EVENTVALIDATION=%2FwEWWwKJvpbuCALStq24BwK3jsrkBALtuvfLDQKC3IeGDAK9zvyMDAKJ1YalCAKF%2Bb%2FgBQKF%2Bb%2FgBQKF%2BdOEDQKF%2BdOEDQKF%2BYfsCwKF%2BYfsCwKF%2BbuDAwKF%2BbuDAwKuk%2BfECQKuk%2BfECQKuk5ubAQKuk5ubAQKuk4%2B%2BCgKuk4%2B%2BCgKuk6PVAwKuk6PVAwKuk9fpDAKuk9fpDAKuk8uMBAKuk8uMBAKuk%2F%2BjDQKuk%2F%2BjDQKuk5PGBgKuk5PGBgKuk8evAwKuk8evAwKuk%2FvCDAKuk%2FvCDAKDusXrDwKDusXrDwKDuvmOBwKDuvmOBwKDuu0lAoO67SUCg7qB%2BAkCg7qB%2BAkCg7q1nwECg7q1nwECg7qpsgoCg7qpsgoCg7rd1gMCg7rd1gMCg7rx7QwCg7rx7QwCg7ql1QkCg7ql1QkCg7rZ6QICg7rZ6QICxtmYngcCxtmYngcCxtmMNQLG2Yw1AsbZoMgJAsbZoMgJAsbZ1OwCAsbZ1OwCAsbZyIMKAsbZyIMKAsbZ%2FKYDAsbZ%2FKYDAsbZkP0MAsbZkP0MAsbZhJAEAsbZhJAEAsbZ%2BPkCAsbZ%2BPkCAsbZ7JwKAsbZ7JwKAtvg%2FoUNAtvg%2FoUNAtvgktgGAtvgktgGAtvghv8PAtvghv8PAtvgupIHAtvgupIHAtvgrikC2%2BCuKQLb4MLNCQLb4MLNCQLb4PbgAgLb4PbgAgLb4OqHCgLb4OqHCkIzkaRk2Lc5p1%2BA0FodgqNefSMy&tbUsername=UTgXosHq&tbPassword=tGS.VU634.!&cbPersistCookie=on
The normal crawling result:
__EVENTARGUMENT=1&__EVENTTARGET=1&__EVENTVALIDATION=/wEWWwKJvpbuCALStq24BwK3jsrkBALtuvfLDQKC3IeGDAK9zvyMDAKJ1YalCAKF%2Bb/gBQKF%2Bb/gBQKF%2BdOEDQKF%2BdOEDQKF%2BYfsCwKF%2BYfsCwKF%2BbuDAwKF%2BbuDAwKuk%2BfECQKuk%2BfECQKuk5ubAQKuk5ubAQKuk4%2B%2BCgKuk4%2B%2BCgKuk6PVAwKuk6PVAwKuk9fpDAKuk9fpDAKuk8uMBAKuk8uMBAKuk/%2BjDQKuk/%2BjDQKuk5PGBgKuk5PGBgKuk8evAwKuk8evAwKuk/vCDAKuk/vCDAKDusXrDwKDusXrDwKDuvmOBwKDuvmOBwKDuu0lAoO67SUCg7qB%2BAkCg7qB%2BAkCg7q1nwECg7q1nwECg7qpsgoCg7qpsgoCg7rd1gMCg7rd1gMCg7rx7QwCg7rx7QwCg7ql1QkCg7ql1QkCg7rZ6QICg7rZ6QICxtmYngcCxtmYngcCxtmMNQLG2Yw1AsbZoMgJAsbZoMgJAsbZ1OwCAsbZ1OwCAsbZyIMKAsbZyIMKAsbZ/KYDAsbZ/KYDAsbZkP0MAsbZkP0MAsbZhJAEAsbZhJAEAsbZ%2BPkCAsbZ%2BPkCAsbZ7JwKAsbZ7JwKAtvg/oUNAtvg/oUNAtvgktgGAtvgktgGAtvghv8PAtvghv8PAtvgupIHAtvgupIHAtvgrikC2%2BCuKQLb4MLNCQLb4MLNCQLb4PbgAgLb4PbgAgLb4OqHCgLb4OqHCkIzkaRk2Lc5p1%2BA0FodgqNefSMy&__VIEWSTATE=/wEPDwUKLTIyMzk2OTgxMQ9kFgICAQ9kFgICAQ9kFgQCAQ8WBB4EaHJlZgUKbG9naW4uYXNweB4JaW5uZXJodG1sBQVsb2dpbmQCAw8WBB8AZB4HVmlzaWJsZWhkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQ9jYlBlcnNpc3RDb29raWXvu6wIgkhiLuMALZWDxQWHeEyxzQ==&__VIEWSTATEGENERATOR=C2EE9ABB&btnLogin=Login&cbPersistCookie=on&tbPassword=test&tbUsername=test
As you see the "btnLogin" parameter is missing in the Htcap crawling result and the missing parameter cause a problem.
Is there any workaround or missing command line parameters?
Greetings,
When crawling a page with a <base href="…">
set in header, the crawler return relative path based on current path and not the one provided in the <base>
tag.
To reproduce
Crawl the page with the html:
<!DOCTYPE html>
<html>
<head>
<base href="http://somewhere.else/someWeirdPath/" target="_self">
</head>
<body>
<a href="1.html">page 1</a>
</body>
</html>
Result
Htcap found http://mycurrent.domain/1.html
but should have found http://somewhere.else/someWeirdPath/1.html
The urls with Errors are only displayed on the html report. Rest no urls are visible. I tried using https://htcap.org/scanme/ but still same output
Errors3
probe_killed
probe_failure
HTTP Error 400: Bad Request
Command:
./htcap.py crawl https://htcap.org htcap.db -v
Initializing . . . done
Database htcap-2.db initialized, crawl started with 10 threads (^C to pause or change verbosity)
[================== ] 5 of 9 pages processed in 0 minutes^C
Crawler is paused.
r resume
v verbose mode
p show progress bar
q quiet mode
Hit ctrl-c again to exit
v
Crawler is running
crawl result for: redirect GET https://htcap.org/scanme/ng/
Phantomjs is no longer under development so we need to move to headless Chrome
I'm trying to use htcap with BurpSuite. However that always fails. Any hints?
root@kali:~/offsec/htcap# python htcap.py crawl -p http://localhost:8080 heise.de test.db
Initializing .
Error: unable to open url: <urlopen error no host given>
root@kali:~/offsec/htcap#
Without the -p flag it works.
Hello!
First amazing script, lately got more familiar with it, and now use almost in every security assesment! It helps me to get very clear insight of web application and makes job more easy! Have some ideas and functions what maybe could be added! When ready gonna let you know you!
Second thing i noticed that on large scale crawl, I,m getting this error: Database karta.db initialized, crawl started with 10 threads
[==== ] 966 of 7928 pages processed in 67 minutes Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 212, in crawl
probe = self.send_probe(request, errors)
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 161, in send_probe
probeArray = self.load_probe_json(jsn)
File "/usr/share/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 4 column 1 - line 4 column 249 (char 58 - 306)
Maybe its just my system warning! I,m Using Kali Linux but its highly modified and build up for me and maybe because of this!
It seems that there is no impact on work flow, it continues crawling and after when working on database there is no error!
Anyways good script and good luck developing it!
Exception in thread Thread-1:
Traceback (most recent call last):
File "E:\Python2\lib\threading.py", line 810, in _bootstrap_inner
self.run()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 64, in run
self.crawl()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 147, in crawl
probe = self.send_probe(request, errors)
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 105, in send_probe
process_timeout=Shared.options['process_timeout']
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 210, in execute
probeArray = self.load_probe_json(jsn)
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 126, in load_probe_json
return json.loads(jsn)
File "E:\Python2\lib\json_init.py", line 338, in loads
return _default_decoder.decode(s)
File "E:\Python2\lib\json\decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 55 column 1 - line 56 column 338 (char 15049 - 17940)
Exception in thread Thread-3:
Traceback (most recent call last):
File "E:\Python2\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 64, in run
self.crawl()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 147, in crawl
probe = self.send_probe(request, errors)
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 105, in send_probe
process_timeout=Shared.options['process_timeout']
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 198, in execute
os.unlink(self.out_file)
WindowsError: [Error 32] : u'c:\users\admin\appdata\local\temp\htcap_output-4b54a465-e420-4c58-88b9-702e80ba8f20.json'
Traceback (most recent call last):
File "E:\Python2\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 64, in run
self.crawl()
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 147, in crawl
probe = self.send_probe(request, errors)
File "E:\exploit\spider\htcap\core\crawl\crawler_thread.py", line 105, in send_probe
process_timeout=Shared.options['process_timeout']
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 210, in execute
probeArray = self.load_probe_json(jsn)
File "E:\exploit\spider\htcap\core\crawl\lib\utils.py", line 128, in load_probe_json
print "-- JSON DECODE ERROR %s" % jsn
IOError: [Errno 0] Error
During JS crawling, the page script containing setTimeout
and setInterval
do not receive the provided arguments.
The signature of the function should be (since the delay is removed):
var intervalID = scope.setInterval(func[, param1, param2, ...]);
but actually is:
var intervalID = scope.setInterval(func);
Some APIs are protected by a Bearer token or some other form of http header based auth
command:python htcap.py crawl -p http:127.0.0.1:8080 -m aggressive -c "security_level=0; PHPSESSID=1ln04buglpc7ljdt95nu0r4a75" -x '.logout.' http://192.168.88.136/bWAPP/bWAPP/sqli_6.php baidu3.db
-x can't exclude (regex) logout urls,
and script can't auto click button to post requests.
Short Video in https://htcap.org is so cool! But i don't know how to repeat.
in the /core/scan/scanner/arachni.py same for sqlmap.py
i change the scanner exe /usr/share/arachni/bin/arachni
to /home/arachni/bin/arachni because that's where i have the arachni framework but doesn't work
it always worked for me in the previous versions https://github.com/Vulnerability-scanner/htcap
but in this version i only get this arachni executable not found can you please help ?
I am getting this error when run crawl command:
Exception in thread Thread-31:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/opt/htcap/core/lib/shell.py", line 44, in executor
self.process = subprocess.Popen(self.cmd,stderr=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0, close_fds=sys.platform != "win32")
File "/usr/lib/python2.7/subprocess.py", line 710, in __init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
Exception in thread Thread-30:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/opt/htcap/core/lib/shell.py", line 44, in executor
self.process = subprocess.Popen(self.cmd,stderr=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0, close_fds=sys.platform != "win32")
File "/usr/lib/python2.7/subprocess.py", line 710, in __init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
Exception in thread Thread-32:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, _self.__kwargs)
File "/opt/htcap/core/lib/shell.py", line 44, in executor
self.process = subprocess.Popen(self.cmd,stderr=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0, close_fds=sys.platform != "win32")
File "/usr/lib/python2.7/subprocess.py", line 710, in __init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
Exception in thread Thread-33:
System is Backbox 32 bit
When launching a crawl, it seems that only the start url and robots.txt are requested through the proxy (during the validation process).
$ ./htcap.py crawl -v -p http:127.0.0.1:8080 http://localhost/index.html test.db
Initializing . . done
Database test.db initialized, crawl started with 10 threads
crawl result for: link GET http://localhost/index.html
new request found link GET http://localhost/test1.html
crawl result for: link GET http://localhost/test1.html
new request found link GET http://localhost/test2.html
new request found link GET http://localhost/index.html
crawl result for: link GET http://localhost/test2.html
new request found link GET http://localhost/test1.html
new request found link GET http://localhost/index.html
Crawl finished, 3 pages analyzed in 0 minutes
http://…/index.html
http://…/robots.txt
curl and python requests is able to GET the page though
I was trying to crawl a website with -m active -v. I am getting these errors. Could you please look into it,
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 69 - 317)
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 341 - 589)
Hi Htcap team,
Thanks for the wonderful work (really, your tool is awesome).
I'd just like to ask a quick question:
Do you know how can we work with applications that require user login?
I'm able to know precisely on which page the login page is. I've got some credentials to test. But, I don't know where in the tool I can tell to use those credentials for login. By default, the tool is using default credentials, which don't work on my application.
As I believe that many other people are also asking this question, it can also be worth adding it on a wiki or some documentation page.
Thanks a lot
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.