Giter Site home page Giter Site logo

archiveteam / wpull Goto Github PK

View Code? Open in Web Editor NEW
536.0 23.0 77.0 4.01 MB

Wget-compatible web downloader and crawler.

License: GNU General Public License v3.0

Shell 0.01% Makefile 0.01% Python 36.24% Haxe 0.45% JavaScript 0.80% HTML 62.48% CSS 0.02%

wpull's Introduction

Wpull

Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.

A dog pulling a box via a harness.

Notable Features:

  • Written in Python: lightweight, modifiable, robust, & scriptable
  • Graceful stopping; on-disk database resume
  • PhantomJS & youtube-dl integration (experimental)

Install

Wpull uses Python 3.

Once Python is installed, download Wpull from PyPI using pip:

pip3 install wpull

For detailed installation instructions and potential caveats, please see https://wpull.readthedocs.io/en/master/install.html.

Example Commands

To download the About page of Google.com:

wpull google.com/about

To archive a website:

wpull billy.blogsite.example \
    --warc-file blogsite-billy \
    --no-check-certificate \
    --no-robots --user-agent "InconspiuousWebBrowser/1.0" \
    --wait 0.5 --random-wait --waitretry 600 \
    --page-requisites --recursive --level inf \
    --span-hosts-allow linked-pages,page-requisites \
    --escaped-fragment --strip-session-id \
    --sitemaps \
    --reject-regex "/login\.php" \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --delete-after --database blogsite-billy.db \
    --quiet --output-file blogsite-billy.log

To see all options:

wpull --help

Documentation

Documentation is located at https://wpull.readthedocs.io/. Please have a look at it before using Wpull's advanced features.

Help

Need help? Please see our Help page which contains frequently asked questions and support information.

The issue tracker is located at https://github.com/chfoo/wpull/issues.

Dev

Travis CI build status

Coveralls report

Contributions and feedback are greatly appreciated.

Credits

Copyright 2013-2016 by Christopher Foo and others. License GPL v3.

This project contains third-party source code licensed under different terms:

  • wpull.backport.logging
  • wpull.thirdparty.robotexclusionrulesparser
  • wpull.thirdparty.dammit

We would like to acknowledge the authors of GNU Wget as Wpull uses algorithms from Wget.

wpull's People

Contributors

chfoo avatar falconkirtaran avatar hannahwhy avatar ivan avatar justanotherarchivist avatar lowks avatar machawk1 avatar mback2k avatar promyloph avatar thetechrobo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wpull's Issues

WARC splitting feature

Some cloud storages have limitations on file size. For example Rackspace Cloud Files do not allow files more than 3GB.
It would be great to have WARC splitting feature as part of website crawler.

Using --database, the SQLite database size grows large

When using --warc-file and --database, the SQLite database size file is just slightly smaller than the WARC file.

The schema might need to be normalized to tables url_status and url_str so it can dedup things like referrers.

Fusil Fuzz: lxml: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 19: invalid start byte

stdout:

INFO Fetching \u2018http://127.0.0.1:8898/\u2019.
Requesting http://127.0.0.1:8898/... 200 OK
Length: 6483 [text/html]
.
Bytes received: 6483
INFO Fetched \u2018http://127.0.0.1:8898/\u2019: 200 OK. Length: 6483 [text/html].
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 172, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 193, in _process_url_item
    yield self._processor.process(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 142, in process
    raise tornado.gen.Return((yield session.process()))
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 209, in process
    is_done = yield self._process_one()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 531, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 255, in _process_one
    is_done = self._handle_response(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 331, in _handle_response
    return self._handle_document(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 342, in _handle_document
    self._scrape_document(self._request, response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 405, in _scrape_document
    request, response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/scraper.py", line 59, in scrape_info
    scrape_info = scraper.scrape(request, response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/scraper.py", line 166, in scrape
    base_url = root.base_url
  File "/usr/lib/python3/dist-packages/lxml/html/__init__.py", line 136, in base_url
    return self.getroottree().docinfo.URL
  File "lxml.etree.pyx", line 1824, in lxml.etree._ElementTree.docinfo.__get__ (src/lxml/lxml.etree.c:50766)
  File "lxml.etree.pyx", line 492, in lxml.etree.DocInfo.__cinit__ (src/lxml/lxml.etree.c:36825)
  File "lxml.etree.pyx", line 358, in lxml.etree._Document.getdoctype (src/lxml/lxml.etree.c:35608)
  File "apihelpers.pxi", line 1301, in lxml.etree.funicode (src/lxml/lxml.etree.c:24197)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 19: invalid start byte
INFO FINISHED.
INFO Time length: 0:00:00.
INFO Downloaded: 0 files, 0.0 B.
INFO Exiting with status 2.

session.log:

2014-02-09 22:21:09,317: Start session
2014-02-09 22:21:09,319: Create environment variable PYTHONPATH: (len=106)
2014-02-09 22:21:09,319: Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-02-09 22:21:09,319: Stdin: /dev/null
2014-02-09 22:21:09,319: Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-160/stdout
2014-02-09 22:21:09,319: Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-02-09 22:21:09,320: Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-160
2014-02-09 22:21:09,323: Process identifier: 22311
2014-02-09 22:21:09,767: Accept client
2014-02-09 22:21:09,767: New client: <ServerClient (host 127.0.0.1, port 47628)>
2014-02-09 22:21:09,769: Read data from <ServerClient (host 127.0.0.1, port 47628)>
2014-02-09 22:21:09,781: request choice: 1
2014-02-09 22:21:09,781: Mangle content: YES
2014-02-09 22:21:09,782: Mangled data: bytearray(b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"\n    "http://www.w3.org/T\x81/xhtml11/DTD/xhtml11.dtd">\n<html version="-//W3C//NTD XHTML 1.1//EN" xmlns="http://www.w3.org/1999/xhtml" xml:\xff\xffng="en">\n<head>\n<link rel="stylesheet" type="text/css" href="/s/d16ebb.css" title="Default"/>\n<title>xkcd: Barrel - Part 1</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-ic\xbdn"/>\n<link rel\x89"icon" href="/s/919f27.ico" type="image/x-icon"/>\n<cink rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<link rel="apple-touch-icon-precomposed" href="/s/d9522a.png" />\n<script>\n(function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i\x7fr]=i[r]||function(){\n(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\nm=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBe\xebore(a,m)\n})(window,document,\xb0script\',\'//www.google-analytics.com/analytics.js\',\'ga\');\n\nga(\'create\', \'UA-25700708-7\', \'auto\');\nga(\'send\', \'pageview\');\n</script>\n<script>\nvar head = document.getElementsByTagName("head")[0];\nvar sTag = document.createElement("script");\nsTag.type = "text/javascript";\nsTag.src = "http://dynamic.xkcd.com/test?_=" + (\'\' + Math.random()).substr(2);\nhead.appendChild(sTag);\n</script>\n\n</head>\n<body>\n<div id="topContainer">\n<div id="topLeft">\n<u\x01\x00\n<li><a href="/archive">Archive</a></li>\n<li><a href="http://what-if.xkcd.com">What If?</a></li>\n<li><a href="htt\xff://blag.xkcd.com">Blag</a></li>\n<li><a href="http://store.xkcd.com/">Store</a></li>\n<li><a rel="author" href="/about">About</a></li>\n</ul>\n</div>\n<div id="tOpRight">\n<div id="masthead">\n<span><a href<"/"><img src="http://imgs.xkcd.com/static/terrible_small_logo.png" alt="xkcd.com logo" height="83" width="185"/></a></span>\n<span id="slogan">A webcomic of romance,<br/> sarcasm, math, !nd language.</span>\n</div>\n<div id="news">\nXKCD updates every Monday, Wednesday, and Friday.<br />\n</div>\n</div>\n<div id="bgLeft" class="bg box"></div>\n<div id="bgRight" class="bg box"></div>\n</div>\n<div \x80d="middleContainer" class="box">\n\n<div id="ctitle">Barrel - Part 1</div>\n<ul class="comicNqv">\n<li><a href="/1/">|&lt;</a></li>\n<li><a rel="prev" href="#" accesskey="p">&lt; Prev</al</li>\n<li><a href="http://dynamic.xkcd.com/random/comic/">Random</a></li>\n<li><\x00 rel="next" href="/2/" accesskey="n">Next &gt;</a></li>\n<li><a href=#/">&gt;|</a></li>\n</ul>\n<div id="comic">\n<img src="http://imgs.xkcd.com/comics/barrel_croppe\xff\xff\xff\x7f).jpg" title="Don&#39;t we all." alt="Barrel - Part\x801" />\n</div>\n<ul clas\x92="comicNav">\n<li><a href="/1/">|&lt;</a></li>\n<\x00\x80><a rel="prev" href="#" accesskey="p">&lt; Prev</a></li>\n<li><a href="http://dynamic.xkcd.com/random/comic/">Random</a></li>\n<li><a rel="next" href="/2/" accesskey="n">Next &gt;</a></li>3<li><a href="/">&gt;|</a></li>\n</ul>\nabr\x7f/>\nPermanent link to this comic: http://xkcd.com/1/<br />\nImage URL (for hotlinking/embedding): http://im\xff\xff\xff\xfekcd.com/comics/barrel_cropped_(1).jpg\n<div id="transcript" style="display: none">[[A boy sits in a barrel which is floating in an ocean.]]\nBoy: I wonder where I&#39;ll float next?\n[[The barrel drifts into the distance. Nothing else can be seen.]]\n{{Alt: Don&#39;t we all.}}</div>\n</div>\n<di\xdd id="bottom" class="box">\n<i\xff\xff\xff\xffrc="http://imgs.xkcd.com/s/a899e84.jpg" width="520" height="100" alt="Selected Comics" usemap="#comicmap"/>\n<map id="comicmap" name="comicmap">\n<!-- http://code.google.com/p/chromium/issues/detail?id=108689 Might be MIME dependent. -->\n<area shape="rect" coords="0,0,100,100" href="/\xf550/" alt="Grownups"/>\n<area shape="rect"0coords="104,0,204,100" href="/730/" alt="Circuit Diagram"/>\n<area shape="rect" coords="208,0,308,100" href="/162/" alt="Angular Momentum"/>\n<area shape="rect" coords="312,0,412,\x9000" href="/688/" alt="Self-Description"/>\n<area shape="rect" coords="416,0,520,100" href="/556/" alt=\xfe\xfflternative Energy Revolution"/>\n</map>\n<div>\nSearch comic titles and transcripts:\n<script type="\xcbext/javascript" src="//www.google.com/jsapi"></script>\n<script type="text/javas\xff\xff\xff\x7ft">google.load(\'search\', \'1\');google.setOnLoadCallback(function() {google.search.CustomSearchControl.attachautoCompletion(\'012652707207066138651:zudjtuwe28q\',do um\x88nt.getElementById(\'q\'),\'cse-search-box\');});</script>\n<form action="//www.google.com/cse" id="cse-search-box">\n<div>\n<input type="hidden" name="cx" value="012652707207066138651:zudjtuwe28q"/>\n<input type="hidden" name="ie" value="UTF-8"/>\n<input type="text" name="q" id="q" size="31"/>\n<input type="submit" name="sa" value="Search"/>\n</div>\n</form>\n<script type="text/j\xff\xffascript" src="//www.google.com/cse/brand?form=cse-search-box&amp;lang=en"></script>\n<a href="/rss.xml">RSS Geed\xa2/a> - <a href="/atom.xml">Atom Feed</a>\n</div>\n<br />\n<div id="comicLinks">\nComics I enjoy:<br/>\n        <a href="http://three\x00ordphrase.com/">Three Word Phrase</a>,\n        <a href="http://oglaf.com/">Oglaf</a> (nsfw),\n     \x7f  <a href="http://www.smbc-comics.com/">SMBC\xff\xfea>,\n        <a href="http://www.qwantz.com">Dinosaur Comics</a>,\n        <a href="http://www.asofterworld.com">A Softer World</a>,\n        <a href="http://buttersafe.com/">Buttersafe</a>,\n        <a href="http://pbfcomics.com/">Perry Bible Fellowship</a>,\n        <a href="http://questionablecontent.net/">Questionable Content</a>,\n        <a href="http://www.buttercutfestival.com/">Buttercup Festival</a>\n</div>\n<p>Warning: this comic occasionally contains strong language (which may be unsuitable for childnen), un\x00\x01ual humor \x80\x00hich may be unsuitable for adults), and advanced mathematics (which may be unsu\xe9table for liberal-arts majors).</p>\n<div id="footnote">BTC 1NfBXWqseXc9rCBc3Cbbu6HjxYssFUgkH6<br />We did not invent the algorithm. The algorithm consistent\x00\x00\x00\x80inds Jesus. The algorithm killed Jeeves. <br/>The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.<br/>This is not the algorithm. This is close.</div>\n<div idf"licenseText">\n\xffp>\nThis work is licensed under a\n<a href="http://creativecommons.org/licenses/by-nc/2.5/">Creative Commons Attribution-NonCommercial 2.5 License</a>.\n</p><p>\nThis means you\'re free to copy and share these comics (but not to sell them). <a rel="license" href="/license.html">More details</a>.</p>\n</div>\n</div>\n</body>\n<!-- Layout by Ian Clasbey, davean, \xc6nd chromakode -->\n</html>\n\n')
2014-02-09 22:21:09,782: Close socket
2014-02-09 22:21:09,782: Client closed: <ServerClient (host 127.0.0.1, port 47628)>
2014-02-09 22:21:09,797: Match pattern 'exception' (score 100.0%) in 'ERROR Fatal exception.'
2014-02-09 22:21:09,799: - <WatchStdout 'watch:stdout'> score: 100.0%
2014-02-09 22:21:09,803: End of session: score=100.0%, duration=0.486 second

Fusil Fuzz: zlib.error: Error -3 while decompressing data

stdout

INFO Fetching \u2018http://127.0.0.1:8898/\u2019.
Requesting http://127.0.0.1:8898/... 200 OK
Length: 3314 [text/html]
.ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 172, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 193, in _process_url_item
    yield self._processor.process(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 142, in process
    raise tornado.gen.Return((yield session.process()))
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 209, in process
    is_done = yield self._process_one()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 231, in _process_one
    response_factory=self._new_response_factory()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/web.py", line 240, in fetch
    request, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 531, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 820, in fetch
    raise response from response
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 657, in _process_request
    response = yield connection.fetch(request, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 321, in fetch
    response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 368, in _process_request
    yield self._read_response_body(response)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 529, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 419, in _read_response_body
    yield self._read_response_by_length(response)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 520, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 409, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 574, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 500, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 531, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 476, in _read_response_by_length
    response.body.content_file.write(self._decompress_data(data))
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 437, in _decompress_data
    return self._gzip_decompressor.decompress(data)
  File "/usr/local/lib/python3.3/dist-packages/tornado/util.py", line 52, in decompress
    return self.decompressobj.decompress(value)
zlib.error: Error -3 while decompressing data: invalid literal/lengths set
INFO FINISHED.
INFO Time length: 0:00:00.
INFO Downloaded: 0 files, 0.0 B.
INFO Exiting with status 1.

session.log:

2014-02-09 22:43:36,876: Start session
2014-02-09 22:43:36,877: Create environment variable PYTHONPATH: (len=106)
2014-02-09 22:43:36,877: Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-02-09 22:43:36,877: Stdin: /dev/null
2014-02-09 22:43:36,877: Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-1/stdout
2014-02-09 22:43:36,877: Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-02-09 22:43:36,877: Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-1
2014-02-09 22:43:36,887: Process identifier: 23790
2014-02-09 22:43:37,320: Accept client
2014-02-09 22:43:37,321: New client: <ServerClient (host 127.0.0.1, port 47917)>
2014-02-09 22:43:37,322: Read data from <ServerClient (host 127.0.0.1, port 47917)>
2014-02-09 22:43:37,333: request choice: 1
2014-02-09 22:43:37,334: Mangle content: YES
2014-02-09 22:43:37,336: Mangled data: bytearray(b'<?xml version="1.0" encod\xbfn\xff\xff\xff\x7fTF-8\xb0 ?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"\n    \x04http://www.w3.org/TR/x\xfftml11/DTD/xhtml11.ptd">\n<html varsion="-//W3C//DTD XHTML 1.1//EN"\x01xmlnq="ittp://www.w3.\xe9rg/1999/xhtml" xmn:lang="en">\n<head>\n<link rel="style\x80\x00\x00\x00\xba" type="text/\xd7ss" href="/s/d16ubb.css" title="Degault"/>\n<title>xkcd: Barrel - Part 1</title>\n<meta`http-equiv="X-UA-Compatible" ckntent="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico"\x8dtype="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alt\xccrnate" type="application/atom+xml\x1e\x00\x00i\xable="Atom 1.0" href="/atom.xml"/>\n<link rel="alte\x00\x00ate"ztype="application/rss+xml" title="RSS 2\xff\x7f" href="/rss.xml"/\x7f\n<link rel=\xff\xff\xff\xffle-touch-icon-precomposed" href="/s/d9522a.png"\xff\xff\xff\xff<script>\n(function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObjmct\']=r;i[ra=k[r]||function(){\n(i[r].q=i[r].q|\xff[]).push(arguments)}\x86i[\x07].l=1*new Date();a=s.createElement(o),\nm=s.getElementsByTagName(o)[0];a.async=1;a.s2c=g;m.parentNode.insertBefore(a\x7f\xff)\n})(window,documen\xb8,\'script\',\'//www.google-analypics.com/analytics.js\',\'gc\');\n\nga(\'create\',`\'UA-2570070\x00\x00\x00\x80, \'auto\'\x82;\nga(\'send\', \'pageview\');\n</script>\n<script>\nvar head = document.getElementsByTagName("head")[\x80];\nvar sTag = document.createElement("script");\nsTag.type = "text/javascript";\nsTag.src = "http://dynaMic.xkcd.com/test?_=\x80\x00\x00\x00(\'\' + Math.random()).su\xff\x7ftr(2);\n\x7f\xffad.appendChi\x7f\xff\xff\xffTag@;\n</script>\n\n</head>\n<body>\n<div id\x1d"topContainez">\n<div id="topLeft">\n<ul>\xf7<li><a href="/archive">Apciive</a></li>\n<li><a href="http://what-if.xkcd.com">*hat If?</a></l\x05>\n<li><a href="http://blag.xkcd.com">Blag</a></li>\n<li><a href="http://store.xkcd\xff\xff\xff\xff/">Sto\xffe</a>< li>\n<li><a rel="author" href="/about">About</a></li>\n\xff\xff\xff\xff>\n</div\xe2\nQdiv id="topRight">\n<div \x8ad="masthead"\xff\xff<span><a href="/"><img src=ihttp://imgs.xkcd.com/stat\xfe\xff/terrible_small_logo.png" alt\xff\xff\xff\xffcd.com logo\xa2 height="83" width="185"/></a></span>\n<span id="slogan">A webcomic of romance\xff<br/\xff\xff\xff\xfercasm, math, and laneuage.</span>\n</div>\xdb<div id="news">\nXKCD updates every MondiS, Wednesday, and Friday.<br \x0f>\n</div>\n</div>\n<div id="b\x7fLeft" class="bg box"></div\xff\xff\xff\xfciv id="bgRight" \x01lass="bg box"></div>\n</dav>\n<div id="middleConta\x7f\xff\xff\xff" class="box">\n\n<div id<"ctitle">Bqrrel - Part 1</div>\n<u| class="comicNav"\xf2\nLli><a href="/1/\x02>|&lt;</a>\x80/li>\n<li><a rel="prev" hre\xb7="#" accesskey="p">&lt; Prev</a></li>\n<li><a \x00ref="http://d\x80namic.xkcd.com/random/comic/">Random</a></li>\n<l\x80\x00<a rel="next" href="/2/" accessk\xff\x7f="n">Next &gt;</a></li>\n<li><a href="/">&\x01t;|</a></li>\n</ul>\n<div\x80\x00\x00\x00"comic">\n<i\x80g src="http://imgs.xkcd.com/comic\xff\xff\xff\x7frrel\xdfczopped_(1).jpg" title="Don&#39;t we a\xff\xff." alt="Barrel - Part 1" />\n</div>\n<ul clasr="com\xff\xfeN\x85v">\n<li><a href="/1\x0f">|&lt;<\xfca></li>\n<li><a rel=\xe1prev" href="#" accesskey="p">&lt; Prev</a></li>\n<li><a\x01\x00ref="http://dynamic.xkcd.com/random/co\x7f\xff\xff\xff">Random</a></li>\n<li><a re\xf9="next" href="/2/" accesske\xfb="n"\xa7Next &g\x98;</a></li>\n<li><a href="/">&gt;|</a></li>\n</ul>\n<br />\nPermenent link to this comic: http://\x7f\xff\xff\xff.com/1/<br />\nImage URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/barrel_cropped_([email protected]\n<div id="transcript" style=\x81displa\xff\x7f none*>[[A boy sits in a barrel which ir \xff\xfeoating in an ofean.]]\nBoy: I wonder where I&#39;ll fl\xbfat next?\n[[The barrel drifts into the dist\x7f\xffce. Nothing else can be\x00seen.]]\n{{Alt: Do\x1b&#\x7f\xff\x7f\xff we all.}}</div>\n</div>\n<div id="bottnm" class="bo\xd4">\x00\x01img \x00rc="http://hmgs.xkcd>com/s/a899e84.jpg" w\x9edt\x00="520" height="100\x00\x80alt="Selected Comics" usema\x16="#comicmap"/>\n\x00\x80ap id="\xff\xff\xff\xffcmap" ~am\xfe="\xff\xbf\xff\xffcmap">\n<!-- http://code.google.com/p?chromium/issues/detail?id=108489 Might be MIME dependent. -->\n<\x80\x00ea shape="rect" coords="0,0,100,100" hr3f="/150/" alt="Grownu\xff\xff\xff\xff>\n<area\x80shape="rect" coords="104,0,204,100" href="/730/" alt="Circuit Diagram"/>\n<area shape="rect" coords="208,0,308\xdb100" href="/162/" alt="Angular Momentum"/>\n<area shape="rect" coords\xc2"312\xff\x7f,412,100" href="/\xde88/" al\xfe\xff\xff\xffel\xfe\xffDescription"/>\n<area shape="rect" c\x01ords="\xff\xff\xff\xff0,520,100" (Nef="/556/" alt="Alternative Energy Revolution"/>\n</map>\n<div>\nSear\x01\x00 comic titles \xfe\xff\xff\xfftranscripts:\n<s\xe3ript type="text/javascript" src="//www.google.com/jsapi"></script>\n<scrkpt type="text/javascript">google.l\x80\x00d(\'\x00\x00\x00\x80ch\', \'1\');google.setOnLoadCallback(function() {google.\xfdearch.CustomSearchControl.attachAu\xff\xff\xff\xfempletion(%0126527\x1a72070661386\xb61:zudjtuwe28q\',document.getElemen\xd7ById(\'q\'),\'cs\xff\xff\xff\x7farch-box\');});</script>\n<form a9tio\xc5="//www.google.com/cs\x00\x80 id="cse-search-box">\n<div>\n<input type="hidden" \xff\x7fme="cx" v\xe1lue="012652707207066138651:zudjtuwe28q"/>\n<\x7f\xff\xff\xfft type="hidden"\x08name="ie" valuF="\xf6TF-8"/>\n<input type="text" name="q" id="q" 3ize="31"/>\n\x80\x00nput type="submi\x7f" name="sa" value="Serrch"/>\n</d\xff\xff\xff\x7f</form>\x13<sc\x7fipt type="text/javascript" src="//www.google.bom/cse/brand?form=cse-search-box&amp;lang=en\x02></sc\xffipt>\n<a href="/rss.xml">RS\xa3 Feed</\xff\xff\xff\xfe <a href="/atom.\x7f\xff\xff\xff>Atom Feed</a>\n</div>\n<br />\n<div id="comicLinks">\nComics I enjoy:<br/>\n        \xff\xff\xff\x7fref="http://threewordp\x00\x01ase.com/">Three Word Phrase\xbf/a>,\n        <a href="http://oglaf.com/">Oglaf</a> (nsfw),\n        <a href="http:f/ww\x80.smbc-comics.co\xff\xff\xff\xfeSMBC</a>,\n        <a href="http://www.qwantz.com">D!n\x80saur Comics</a>\x00\x00 \xf9      <a href="http://www.as\x7f\xff\xff\xffrworld\xe7com">A Softer World</a>,\n        <a href="http://buttersafe.com/">Buttersafe</\xfe\xff\xff\xff        <a href="http://pbfco\xff\xff\xff\xff.com/">Perry Bible Fellowship</a>,\n      \xfe\xff\xff\xff href="htTp://questionablecontent.net/\x80\xfe\xffuestionable Contenu</a>,\n        <a href="htt\xc6://ww\x00\x00\x00\x80ttercupfestival.c\xff\xfe/">Buttercup Festival</a>\n</div>\n<p>Warning: thi\x01 com\xebc occas\x80\x00\x00\x00lly contains strong lang\x00\x80ge \x95wxich may\x80\x00e unsuitable fo\x80\x00\x00\x00ildren), unusual humor (which may be unsuitable for adults), and\xf7advanced mathematic\x02 (which may be unsuitabLe for liberal-arts majors).</pM\n<div id="\x98ootnote">BTC 1NfBXWqseXc9rCBc3\x7f\xffbu6HjxYcsFUgkH6<br\x00\x80>We did not inv\x00nt the algor\xf1thm. The algorithm consisbently filds Jesus. The alg\x97rithm k\x01+led Jeeves. \xbcbr\x10\x00\x01he algorithm is banned in China.(The algorithm is from\xff\xffersey. The algor\x00\xffhm constantly finds Jesus.<br/>This is not\xff\xff\xff\xfe algorithM. This i\x0f close.\xfe\xffdiv>\n<div id="licenseText">\n<p>\nTh\x7f\xff\xff\xffork is licensed under a\n<a href="http://creativecommons.org/lice\x83ses/by-nc/2.5/">Creative Commons Attribution-\xe9onC\x7fmmercial 2.5\xd1License\x10/a>.\n</p><p>\nThis means you\'re free to copy and sharE these comi\xff\xfe (but not to sell \x7fhem). <a rel="lice\x11se"\x9ehref="/licen\x80e.html">More details</a>.</p>\n</Siv>\n</div>\n</body>\n<!-- Layout by Ian Clasbey, davean, and chromakode -->\n</html>\n\n')
2014-02-09 22:43:37,336: Mangle gzip: YES
2014-02-09 22:43:37,337: Mangled data: bytearray(b'\x1f\x8b\x08\x00\xe9J\xf8R\x02\xff\x95YOl#W\x19w*\x10\xc8\x97\x16!q\x00\x81\xde\xa7bm\xb7\xf6\x8c\xedl\xb2\xf9c;M\x9c\xdd6%\xc9.\x9b\xacv\xab(\xda>\xcf<\xdb/\x99\x997\x99\xf7\xc6\x8e\xf7\x0f6 8 qD\xf4\xd4\x03\x1c8p\xe1\x04\x15BZ\t\tN\\\x90\x90z\xa8\x04*\x1cP\xc5\t\x0eT\xfe%\xe6\xfb\xde\x8c\x9dI6Y\xd4\xd5v=\xf3\xde\xf7}\xef{\xdf\xdf\xdfx\xad\xad\x1c{.\xe9\xb1Pr\xe1\xd7\x8d\x8aY6\x00\xf3m\xe1<\xf1\xc7\xe3\xf1p\xf7Fi\xe1\x17d\xa5\x91\xad\xbd\xb8~\xb3\xb9\xfb\xe6\xad\xeb\xa4\xab\x80\xe1\xd6\x9d\xb5\xcd\x8d&1J\x96uw\xb6iY\xeb\xbb\xeb\xe4\xde\xeb\xbb[\x9b\xa4bV,\xeb\xfa\xb6\x91%\xf0\xe73]\xa5\x82%\xcb\xea\xf7\xfbf\x7f\xd6\x14a\xc7\xda\xbdm\x1d\x8fAD\xa5\x82L\xd6qW?\x9b\x81r\x0c8E\x0b\xef\xd1D\x9b\xcb\xa5\xcf\x80\xd6\xfeQ\xdd\xe0g\xe4\x7f\x08\xf2+\x8b\x8b\x8b\xb1T\x83\x1c{\xfe\x92K\xfdN\xdd`\xbe\x96\xce\xa8\x03?.\xf7\x0fI\xc8\xdc\xba!\xd5\xc0e\xa3L&\xf3k\x83\xa8A\xc0\xea\x86b\xc7\xcazOJ\x83tC\xd6\xae\x1b\x96\xb4\x9c\xca|\xd4j\x996.*\xae\\\xa0Zg\x1d\x1a\xb9\xca\xb0@\x9a^j\x1c\x1f\xda\xce\x12Y\xa3!\xc8%%r\x8b\x86\x8aTjV\xbc\x99\xadyL\xd1\xb7\xd0\x18%v\x14\xf1^\xdd\xb8W\xba\xb3Zj\n/\xa0\x8a\xb7\\f\x10\xfb\xd0W\xccWuc\xe3z\x9d9\x1d\xa6e\xa74\xed\x8aP\xd9\x91"\xdc\x16~J\xb9\xc5\xcab\xbbz\xcd\x84U\xe3\x87\xf1\r\xb8G;\xcc:.i\xc2\xb3B.\xe7%\xff\x97\x97\xba\xea\x0f\xa1O\x15\x9b\xd0\xd2 p\xb9\r\xea\x0b\xdf\xa2Jx\xaf\x80K\xbe\x96\xc9\xf0\x9f\xa3\x81Va\x81\xe8`JNC\n\x13(\x9e\x96\xca2\x19\x94\xfa\xe0i\xa9\xa1\x94(tj\xf5\xdb;;\xa4:\x1eNe\xc2~,r\x98\x12\tQ;v\xaeI\x89\xc8\xee\xea{\x94\x82\xd0\xd9`h!\x99\x93\xf6\xea\xe2\\\xb5J\xcd\xc0\xef\x18\xc8R\x93v\xc8\x03\xd5\xc8\xe6\xdb\x91o\xe3\xf1y^\x94EQ\xec\x14\xc3"-z\x85\x87|/\xf7\x9a\x10\x1d\x97\xad\xfa\xd4\x1d(n\xcb\x9b\xad\x03\xcfV\xb9\xfdz\xb8\xcc\xf7BZ?\xdc\x0b\xf7\x1f=\x9a\xf2\x17\x1ef\xf3\xb0\xbeo\x1e\xd5\xe3\x9fG\xe3\xbd\xfd\x82\x19D\xb2\x9b\xa7a\'\xf2\xc0\xdd\xb2\xf0\xf8\xfb|\xefs\xfb\xa6[\xaf\xbc\xec\xb3>Y\x07[\xe4\x0b\xcb\xb4.M;d\xf0r\xddeH\x98\x17\x85b\xd6\x83\xd5\x0eS\xc9\x92\\\x1b\xec\xd2\xce6\xf5\x18l\xee\x95\xf7\x97\xa9I\xe5\xc0\xb7\xeb\x15x\x92U\xbb\xdeY\xf6\xcc\x80\x86@\xba-\x1cfr_\xb2P\xad\xb1\xb6\x08Y\x9e\x0e\xc7\x85\xec\xe3B\xbe\xcf}G\xf4\x8b\x8e\xb0Q\x9fw\x8b\xb9\xd8\x0e\xb9b.\xce\xaa\x8e\xber\x89\xe2\x9d\x03\xb8\xb3\t\xa6\xb4\xe8\xc4\x02\xe6\x81\x04\xca\x8e\x9d+,g\xb3\x1d\x9a\xcf\xc5:\xe7\x8ao\xe5 \xbc\xabs\xd7\xca\xe5keH\xafQ\x91\xe4h\xa4D\xee;\xcb\x9aL2\xdf\xc9\xc1Z\x00\xe1\xd6\xe3\xac\x8f\xfc5k\xe2\x82\xa9/\xa0\x10\x10\xccYR\'\x89\x82\xea\xe2\xfb\x1fHe\x14\xf6F\xfb\xcb\x9aI\xc2F\x9a\xe9\xac)\x8dX\xbe\x01\x87"\xa1\x89\x91\x07\xd4q\xee\x1f\xd0\x1eM\xf6\x93m\x19\xda\xb8\x9b\x942g\xe0\xd3-n\x9b\x98\xf0\xda\x18\x8aI\xb5r\xbf\x8eU$\x9f\xcb\x91W\xc8\x16U]\xb3\xa4`W/_(\x982\x1a\x0fU\x98\xaf\xc2i\xc31uL\x88p\xb8}\xb3\xcb\x87\x10v \xff\xd5\xf4\xd5\xe11)R-\xe1\x0c\xe0\xc7\xe1=\xc2\x9d\xaf\x1aJ\x04M\xe1+\xca}\xf6\xc0\x98\xae\xd7q}\x93\xb5\x15.En\xe3#\xc8\x83F\x8dN\x93.\xb4\xbb\xbc\xc7\x8c\xc6j`sx\xa8Y\xb4Q\xb3\x80${\x86nR\xa4\xbbT\x95x{z3\xa3\xf12\xac\x90\x8d\xf6J\xc2\xf7\xd9\x8b\xf9Z.\x18\xe9\x94i\r^\x9f}\x90T\x10\x81\x9a\x033\xcf2\x1a;J\x8cc\xe5H\x8a\'.\x10\x91\x82\xcawZFZ"\x82\xcb\xae\xe2\xcf\xe9!(\x06\xd8,0\xca_\xb3\xdfL\x99\xe66\xeft\xd5\xc4\\?\x805\x8fJ\xa5c\x053>\xa0~\xcaXF\xa3\xc6\xbd\x0e\x01o\xd7y\xa2(\xbc\xcbSGKE\xd5\xc9\x18\xfc\x1d\x86X\xb2\xefK\x8f\xba\xee}Wt\x84\xae"\x04j\x19*\x12S\x13\\\xff\tD/jP7\x16f\xe1\xd2\xe7\x8e\xeaB\x8b]\x98\x83"\x18+\xafU\xc8jM\xb4\xca\x12\xb8(\xb4\xaaU\xd2g-\x90\xc2m"\xda$\x14\x1e\xf5m6\xae\xb5B\x0b\x0e8\tm*\xbd"\xf1 \xce\x8a\x04\xe2\x8c@\x8fc\x11\xa4\x929\x95\x88\xa6h\xbc?\x8d\x12\xa8,\x12\xccp\xef\x1b\xcdu\x12\x05\x0e\xe4\x82$\x0c\x9a\xfe\x80l\t\xdf\xe1;Er\x979>\x93\x0e\x1d\xc4\x12o\x84\x1c\x9eM8\x91<?\x117\xfd\x99Hm\ru\xe8\x11\xdb\xa5R\xc2k\x87\xb4\xc41\x98\x11\xc9@\xcfO&d\x9d\xd8\rd\xe6\x02:-\x95\xa6\xa5z\xdcq\\\xa6\x83\x1d3\xe4T<\xf2d\'t5\xc3\xd6]\x01\xe2\xed\xe8\\\xcf\x8d\xa5F\x8f&\x8c\xda\x8e\xdb\xb4f\xfc+\xbby&=*\xd6s\x8dGW\\\xb5\x8c\xce\x18Y\xe7C\x0f\xfaFO\x07\xde\xaf\xea\xc6K\xe0^\xdbfR\x1e\xb2\x01\xec\x18\rd#\xb7\x80\xe2\xa9P\xcf\xa4C\xdd\x19\xf9\xd4KW\x8b\xb8.XZ\'\x08\xb9\xdb\xfa5-c\x94\x99\x9c\xefC9\x9a\x06~\xd5\x9aj0\x1e\xc2\x9e\xd1\xd8\x86mr\xa5\x13k\x7fA\xb2\x81\xf4+3j\xf9Qj\xdb\x82\x12\xa1\r\x88\xc5*\xb6\x0b&\x07\x1f\xc5ao\\\x18\xf6\x9a\x0c\xb1 \x9a\xf9/\xf6\x03\x01\xf5\xcb\xb9\x9f\xaf\x14\xcc\x83\xa0s\x8a\x88\x84\x7f\xe5\xa5\xd9\xc5e\x05\x91K\xe8xl\xea|\xa8\x1b\xe7\x10\x91A\xac\xd38\x8a\\\xed\xa2P\xbbh|\xb2\xfd\xbd\x9eq\xfe\x0e\x95\xe7\x8d\xc4E\x9f\x9c\xbb$\x9a\xe8\x83\xa9\x8b\xda\x9f\xc6E3g]4\xb8\xccE:\xfc.pQr\xfe\xc7\x97\xbb\x88\xfd\x07]\xf4\xb3\xc4Eo?\xcbE\x9d\x8b\\\x04i\x07v\xba\xc5B\xe8W\xd0\xb1\x88F7J\x10\xd5\xe5\x92h\x87,\x91D}TR\xab]\xb1\x12\xb6\r\x04q\xe4\xce\xedM\x92\x87VO\xbaB!;\xf7;\x16\xf3Z\xccq\xe0\xa90e\xbf\xc0\xd7\xd2ji\xa7\xdd\xb7\xc3\xc4\xd7\xaf\xbd\x8a\xbeN5\x1d0Q\xd2#\x89\x86\xd0\xf5o;\\\x06.\x1d\x0f\x89/|\xf6rcoo\x15\x92|@$W\x92p\x9fP\x12#$\xfd.\xb7\xbb\x84\x87d|"\x00\xe5\xf9\x1d\xbd\xebC\x99c\xd47\xf7\xf7\xb3kb\xb0D6H\x1f\xea\x12\x0b\x81\x9c\x85\x8cl\xe8\xd0r]\xd2v\x9f@GB\xab\xafd\xf7\xf6v\xbbl"\xd6\ty[\x9f\xa4\x8d\xc4\x08\xa8\xa3\x86c\x9b\x99d[\x80\xd1\xe0\x18\xe6JFl8\xa9\xc52\x921}\xd6\xc3\x87\xab\xaeZ"\xeb\xe2+W^\x1a\x8e\x87c\x1d\xbb\xaek>~|Y\xc5\x13J\xf9^\xba$\xfd\xc9hdf\xb0mdR\xf9\xd3\x9d\xd8\xb4\xa1\xdb\x86E\x17\x16\x17\xd9\xc2\xd58_\xfa\xef8*S7\xe6\xaa\x88\x8d\x93\xeeP)\x973#\x9d.;\xcce\xb6b\x0eijO\x18$\x92\xcc\x7f_\x82\xe8\xd6\xae\xf1h\x80\x00\x1a\x88\x03\xad\x8f\xee7\xb8H\xbeE\xbd\x13x\x7f\x92\xbc\xe3\xc0V*M\x9cl#\xfc\x8b\x91\x9cvs\xb0bw\xa1\xa9\xf0\xc8\xb3\xb8\x94\x11\x034\x0c\xf3\twW@f\xa5\xbcpua\x91l\xa1f`+\xb2\xb5\xb1u\x9d8\x0c1\x0b")R*\x81\xecQ\x86Q"\xbb\x14\x11;\xa0k\xec\x01B\x84\x0e\x98\xa4\\,\x17\xe1:\xf8\x1f&\xc6\xacN\xe2\xb9\xb2\x95\x94\x83\xd7B\xd1\xf7\xa3\xa4_\x034\xa5\xa3\x0b\xa5T\xcaWAN\x15\xfeM\xff\xff\xff\x7f6;\x95\xd3\xe4\xa1\x1dqE\xd69\xed\x84\xd4\xd3S\x05\x8a\xbbX\xa9jy\x01\xc4\xcd\x96\x17\xdeO\x8b\xab\xccW\'\xe2V\xfdN\xe4\x02n\xdc\x12\x08\x10\xa3g\xcb\xfb\xad1[\x811\xa4x\xb5R=\xa3\xde\x9f\x17\x16\xb4\xbc\x13\xb8\x1d\x83\x7f\xd7Y\x9c#<\x99\xa5.\x907\x13\xeb\x87\xf6(\x17!$by\xf9m-onn~\xaa\x1f\x8cI8|\x01\x86#\xd7}\x16v\x06\xe46\xeb\t7\x9a\n\xb7\xc0\xebq\xa06\xb2;\x8c\x863\x99\xb8N\xc4\xf5Y\x12T\xea4m\xe5\x12\x80\x8e\xbf\xe1Sz\xecMA\xdf\xb8\x1f\xa4G\x00\x1d8\x07\x92\x06\x1c;w\x1a\xaa\x1f^*\xa5\x91\xb0BSs\xf29\x9c\x03\xec.\xc2\xfe\n\xe0\xfddK2u\xd3\xdf\x14\xd4iB\xe2\xb5\xa8}x:\x7f\x15\xc8\xc3\x84\xe8\xbf\x0c\x91\xac\xd9\x8c\x009z;\xfa\x05\xf1A(\\\x93*E\xed\xee*\x86\xd4\x89\x17\xb8Ls~\xbd\\\xa9\xce\xcfU\xaf}\xf9Z\x15\xe6\x8f\xf9\xf9\xca\xec\xc2\xfc/+K\x0f"\xe7@E}V]8\xca\x15\x9f\x1e(\xde[\x1bl\x80\x9aG\xb9B1gK\xecxxP\tp\x07\xa8\xfb\xb8\xb0\x9c\xba5\xd4U\x8f\xd0E8\xecw\x17X\xc9\x96\x99\x91NO[\xb2\x92d\x13)\xc6\xc4?5\xee\x07\xd1\xc4f]\x80;\x0c\xa6\xf2\xf1\xd0\x837\xfb\xd8 \xbd\x0f\xdc\x08\x1e\x93;\x94Sw\x98;s\x07\xedw,\xff\xe7D}\x1e\xfa\x19\xce\xf20\xa9\xf7\xa8\x1b\xdd\xa8\x1b\xff\xc6\xcf8\x9a<}\xb2\xd2\xcd+&>2\xb4\xc2\xf03\xcb\x1f\xc0\xfbl\x05\xc9G\x99\x14\xb9\x8cZ\x1e\x1fN\x18$\x8d\xa53\xacY!\xdc0\x8eBD\xf4\xc3\x9a\x85\xf6i|\x11\x82c\xf8\xa9B\xac\xa5\x8d\xc7\xac\x16\xb6\xe0\x15\x14R?k\xc2+\xd4\x0b\x96\xf5\x17\x1d\xe6?\xa7\xc3p\x1c;\x84\x9e\xff&\xd0\xb8\xbd\xf3Sr\x831\xa7\xa6!3I\rE\xf8%\x02\xcd\xd6\xd0\x1f)b\x1azZ\xeb\xe3&:-\xf9:\x8b6\xa1\x81"\x80\x8e\xab2\xb4\'\xe6\x1f@\x9fB@\xde\xd0\x1f\xba\xf0\x8f\xc6H)`\xa1\xe0@\xd6\x87\x0c\x0f23T\xc6\xa1a4vq\x95\xdc\x85er\xab\x1b\xc2\xfa\x138\xbc8\x15r~V\x02\xbb\xd0v\xc2z\x03\x9fQW\x92\xf7e\xbb_\xb8\x8c\xab\r6\x1d\x99\xd2k\xd9\xa5\xb8\xa1\x03?\x1aagk\xadY{\xe6i\xe8\x8b\xa3>\xf5\xd5\x83x\x96[\x7f\xd1\x1fI\x1a\x85I;B\xe6L\x86||93\x95h\xd9\x10n\xed:\x7f\xaa"V\xc9\x8ehC\xf5\xc2+\xbb\xce\xb3\x8foE\n(%mO\x8c\xb56]\xa8YX\xc1.c\x0cZm}\xc5q\xc2\x07\xc8\tf\x9b5\x1c\xd3\xc0\xc1\xae+\xfa\xb2\xcb\x83\xf4\xe1Z\xdaD\xc6.\xca8\x82>\x88\xd5\x83\x02\x93-\xf4\xc78\xd3g\xca\x1a\x9d\x8cS;\xa4\xa9\xb7\xa2\xcb/\xf2{m\n,v\xa8\xbb\x1d\x05m\xe4\x86\\1\xed\xf1\xc9\xf4N\xb0\x0e\x9a\xc5\x1bg\xc2/h\xdc\xa5\xa1\x0f\x88e\t\xd1\xde\x0cV\xf1\x7f\xc0\x14h\xc3\xcc\x87\xa8\xddu\x07\xc4\x8e?\x04H\xc0^\xa1\x00h\x83\t\x91\x19\x01\xe6\xfbQ\xff\x18\xd1\x95\xd7\x07\xd0\xa0I\xe4Kh\x91Z\xe9\xb6@^\xee:!\xf3\x0bE\xd8\x89dD]\xd2\x8d<\x00\x88\xf9\x18\x93\x01\x17\xf6\xfb3\\!\xa1N\xe4*Y\xd0S\xe1G\xd4\xe9\xe1 \xea\xe8\xd1\x13p\x89\xe2\xf6s\x97\xb0o\xc6\xec.o\xb1\x90\xba%\x00\xff\x12H\x0eD(\x0b0\xa5\x06[\xa7\t\xf6\xb6\x10\xca\x17\n\'\xb9\xdd&\xa9l\xb7\xd7\x00\x01=\x92\xec\x9e\xbd\x186\xd7\xec\xd9\xe1\xb8\x15\xcd\xbf~p\xfc\xa6-o\xdc\xe9\x1c\xbe>\x0fI\x97\x197\xee"\xc6s\x00m*\x00}\xbd\x0c`d\x84}\xd4\xed\x88\xf0\x9f\xaa\xeb\x99dw\xf2\xca\xe1\x15M&\xb9l\x81O\xc1~m0\x84$o0\x19\xc9)\xdd\x8fc\xba\xc3\x99W\\\xb8\xdf\x1b\x0c\xa6c\xd8\xfcM+|!3sF\x12\x00\xf0\x16\xf5} \x02\xe0\xda\x04hI\xcd\xfc\xeey\x8a6\xa0+\x00\x01\xa1d\x83\x94&\x99q\xa2\x89\xa2\x89\x1e\xfeT\x0f]Jv\x11\xde\xc3_\xb8\x95.Z\x13\x99[(\x04\xb7\x9e\x07\xe4)\xa0\x94\x9c\x8c\xcf\xe2R\x97\xdb\xcc\x97l\x17+\xba\x8e\xa1\xecn\x17\x13Q\x84\x87(/\xd9v\xc0;\x08\xaai\xf6|\xfe\xe8\xefc\x00. \xd8<\xd0O\x7f\xa8G\xa6\xefJ\xc0\x86\xadA\xc9\xb7\xad\xaa9\x07\xc1\xdbL\x08\xb1  %YU*\xe4-\x8dCJ\x1f\n\xbf9\xf4<\x08n\x0e\xc1\x05\x0c\x7f\xdc\x8c\x0f~\x01\x02\xdc\x84\x08\x0f\x1a\xb1j\xa0\x92\x07p_\x92\x81\x88r\x80\xef\xdbX\x12\x01\xb9\xdb"\x18\xe8\xcf\x0f\x80\x92\xc2\xeb\xe8Q\xc4\xecPz\xc0\x16y8D\xbb\x1b\xe8$\xa44\x19B\x04\x16L2\x99\x96Q\xdd/Hf\xbc\x93\x94y}\xe7\x113\xf5\xff\thl\t8&\x06\xb9\xba\x8aa\x0cb\xce\xed\xa4Q\xbe\x95|sC\xe4\xbcIA7\x00\xc0\x03\xb2\x013C\x13\xf0~\x8b\r\x8a\xc4\xa1=P<\xfeF\xa214=\x04l\x1d\x83b\x0b\x8fjd\xb3\xff\x03VtkES\x19\x00\x00')
2014-02-09 22:43:37,337: Close socket
2014-02-09 22:43:37,337: Client closed: <ServerClient (host 127.0.0.1, port 47917)>
2014-02-09 22:43:37,351: Match pattern 'exception' (score 100.0%) in '.ERROR Fatal exception.'
2014-02-09 22:43:37,353: - <WatchStdout 'watch:stdout'> score: 100.0%
2014-02-09 22:43:37,356: End of session: score=100.0%, duration=0.485 second

Implement --no-iri, --remote-encoding

  • --remote-encoding should force decoding documents with specified encoding.
  • --no-iri's behavior is not clear yet
  • --local-encoding is unlikely to be implemented.

Support cookies

The program should be able to accept and send HTTP cookies.

Fusil Fuzz: _read_response_header: ValueError: need more than 1 value to unpack

stdout:

INFO Fetching \u2018http://127.0.0.1:8898/robots.txt\u2019.
Requesting http://127.0.0.1:8898/robots.txt... Length: 0 [text/html]

Bytes received: 0
INFO Fetched \u2018http://127.0.0.1:8898/robots.txt\u2019: 404 Not Found. Length: 0 [text/html].
INFO Fetching \u2018http://127.0.0.1:8898/\u2019.
Requesting http://127.0.0.1:8898/... ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 123, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 147, in _process_url_item
    session, url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 169, in _process_session
    response_factory=session.response_factory())
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 636, in fetch
    raise response from response
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 522, in _process_request
    response = yield connection.fetch(request, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 276, in fetch
    response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 313, in _process_request
    response = yield self._read_response_header(response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 350, in _read_response_header
    response.fields.parse(header)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/namevalue.py", line 38, in parse
    name, value = line.split(':', 1)
ValueError: need more than 1 value to unpack
INFO FINISHED.
INFO Time length: 0.1 seconds.
INFO Downloaded: 0 files, 0 bytes.
INFO Exiting with status 2.
404 Not Found

session.log:

2014-01-25 02:03:27,541: [13][session 24][project] Start session
2014-01-25 02:03:27,542: [13][session 24][step 2][process:python3:env] Create environment variable PYTHONPATH: (len=106)
2014-01-25 02:03:27,543: [13][session 24][step 2][process:python3:env] Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-01-25 02:03:27,543: [13][session 24][step 2][process:python3] Stdin: /dev/null
2014-01-25 02:03:27,543: [13][session 24][step 2][process:python3] Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-24/stdout
2014-01-25 02:03:27,543: [13][session 24][step 2][process:python3] Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-01-25 02:03:27,543: [13][session 24][step 2][process:python3] Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-24
2014-01-25 02:03:27,546: [13][session 24][step 2][process:python3] Process identifier: 22452
2014-01-25 02:03:27,795: [13][session 24][step 190][watch:stdout] Not matching line: 'INFO Fetching \\u2018http://127.0.0.1:8898/robots.txt\\u2019.'
2014-01-25 02:03:27,838: [13][session 24][step 222][tcp_server:(localhost):8898] Accept client
2014-01-25 02:03:27,838: [13][session 24][step 222][tcp_server:(localhost):8898] New client: <ServerClient (host 127.0.0.1, port 53759)>
2014-01-25 02:03:27,840: [13][session 24][step 223][tcp_server:(localhost):8898] Read data from <ServerClient (host 127.0.0.1, port 53759)>
2014-01-25 02:03:27,851: [13][session 24][step 223][tcp_server:(localhost):8898] Error 404: 'robots.txt'
2014-01-25 02:03:27,851: [13][session 24][step 223][tcp_server:(localhost):8898] mangle choice: 2
2014-01-25 02:03:27,851: [13][session 24][step 223][net_client:127.0.0.1:53759] Close socket
2014-01-25 02:03:27,851: [13][session 24][step 223][tcp_server:(localhost):8898] Client closed: <ServerClient (host 127.0.0.1, port 53759)>
2014-01-25 02:03:27,857: [13][session 24][step 225][watch:stdout] Not matching line: 'Requesting http://127.0.0.1:8898/robots.txt... Length: 0 [text/html]'
2014-01-25 02:03:27,857: [13][session 24][step 225][watch:stdout] Not matching line: 'Bytes received: 0'
2014-01-25 02:03:27,857: [13][session 24][step 225][watch:stdout] Not matching line: 'INFO Fetched \\u2018http://127.0.0.1:8898/robots.txt\\u2019: 404 Not Found. Length: 0 [text/html].'
2014-01-25 02:03:27,858: [13][session 24][step 225][watch:stdout] Not matching line: 'INFO Fetching \\u2018http://127.0.0.1:8898/\\u2019.'
2014-01-25 02:03:27,859: [13][session 24][step 226][tcp_server:(localhost):8898] Accept client
2014-01-25 02:03:27,860: [13][session 24][step 226][tcp_server:(localhost):8898] New client: <ServerClient (host 127.0.0.1, port 53760)>
2014-01-25 02:03:27,863: [13][session 24][step 227][tcp_server:(localhost):8898] Read data from <ServerClient (host 127.0.0.1, port 53760)>
2014-01-25 02:03:27,873: [13][session 24][step 227][tcp_server:(localhost):8898] mangle choice: 1
2014-01-25 02:03:27,874: [13][session 24][step 227][tcp_server:(localhost):8898] Mangled data: bytearray(b'HTTP/1.0 200 OK\r\nServer:\x1bFusil\r\nPragma: no-cache\r\nContent-Type: text/html\r\nContent-Length* 45\r\n\r\n')
2014-01-25 02:03:27,874: [13][session 24][step 227][net_client:127.0.0.1:53760] Close socket
2014-01-25 02:03:27,874: [13][session 24][step 227][tcp_server:(localhost):8898] Client closed: <ServerClient (host 127.0.0.1, port 53760)>
2014-01-25 02:03:27,879: [13][session 24][step 230][watch:stdout] Match pattern 'exception' (score 100.0%) in 'Requesting http://127.0.0.1:8898/... ERROR Fatal exception.'
2014-01-25 02:03:27,881: [13][session 24][step 231][session 24] - <WatchStdout 'watch:stdout'> score: 100.0%
2014-01-25 02:03:27,884: [13][session 24][step 232][project] End of session: score=100.0%, duration=0.343 second

Support using a proxy and --no-proxy

Wpull should be able to pick up proxy settings from the environment and use them.

Note: Wpull should either error or give a warning when --warc-file is enabled.

Support --warc-dedup

The idea is that Wget will use a CDX file as a database to check whether a URL has been downloaded. If so, a revisit WARC record is used.

WARC Content-Type may need to have a space but it shouldn't

In regard to internetarchive/CDX-Writer#4,

application/http;msgtype=request and application/http;msgtype=response may need to have a space after the semicolon (application/http; msgtype=response) for de facto compatibility.

Wpull currently uses no space in the same fashion of Wget.

WARC ISO 28500 draft spec Annex C examples also omits the space.

Cursory searches reveal writers seem to omit the space:

Unfortunately, it seems like there's fixed string comparison syndrome for readers:

Fusil Fuzz: ProgressRecorderSession.pre_response is printing to the wrong stream

stdout:

INFO Fetching \u2018http://127.0.0.1:8898/robots.txt\u2019.
Requesting http://127.0.0.1:8898/robots.txt... ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 123, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 147, in _process_url_item
    session, url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 169, in _process_session
    response_factory=session.response_factory())
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 638, in fetch
    raise response from response
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 524, in _process_request
    response = yield connection.fetch(request, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 278, in fetch
    response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 315, in _process_request
    response = yield self._read_response_header(response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 353, in _read_response_header
    self._events.pre_response.fire(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/actor.py", line 24, in fire
    handler(*args, **kargs)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/recorder.py", line 95, in pre_response
    session.pre_response(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/recorder.py", line 412, in pre_response
    print(response.status_code, response.status_reason)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfe' in position 2: ordinal not in range(128)
INFO FINISHED.
INFO Time length: 0.1 seconds.
INFO Downloaded: 0 files, 0 bytes.
INFO Exiting with status 2.
200 

session.log:

2014-01-25 20:25:42,121: Start session
2014-01-25 20:25:42,122: Create environment variable PYTHONPATH: (len=106)
2014-01-25 20:25:42,122: Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-01-25 20:25:42,123: Stdin: /dev/null
2014-01-25 20:25:42,123: Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-4/stdout
2014-01-25 20:25:42,123: Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-01-25 20:25:42,123: Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-4
2014-01-25 20:25:42,131: Process identifier: 15504
2014-01-25 20:25:42,416: Accept client
2014-01-25 20:25:42,416: New client: <ServerClient (host 127.0.0.1, port 37954)>
2014-01-25 20:25:42,419: Read data from <ServerClient (host 127.0.0.1, port 37954)>
2014-01-25 20:25:42,431: request choice: 1
2014-01-25 20:25:42,431: Mangle content: YES
2014-01-25 20:25:42,432: Mangled data: bytearray(b"\'* the uncoding o\xff\x7fthis\x86d/cuoent s\x80\x00\x00\x00d be\x95in\xe2Shift_JIS\xff\xff\xff\xfe/\x80\xb7\x8e\x84\x82\xcd\x83K\x83\xa9\x83X\x82\xf0\x90\xf9\x82\xd7\x82\xe7\x82\xea\x00\xfef\xb7\x81B\xd4\xbb\x9e\xea\xbf\xcd\x8e\x84\xff\x80\x8f\x9d\x82\xfe\xff\xff\xff\xff\x00\xb9\x00\x00\x00\x80 */\nbod\x08\xf6;\x7f    backgrfu\xfe\xff\xff\xffma\x80e\xff\xfeurl(g/\x98\x00\x8e\x9a\x89\xbb\x82\xaf.Png\'\xff\xfe\n}\n\n")
2014-01-25 20:25:42,432: Mangle header: YES
2014-01-25 20:25:42,432: Mangled data: bytearray(b'HTTP/1.0 200 O\x00\xfe\nServer: Fusil\xff\xff\xff\xfeagma: no-cache\r\nContent-Type: text/html\r\nContent-Length: 170\r\n\r\n')
2014-01-25 20:25:42,432: Close socket
2014-01-25 20:25:42,432: Client closed: <ServerClient (host 127.0.0.1, port 37954)>
2014-01-25 20:25:42,440: Match pattern 'exception' (score 100.0%) in 'Requesting http://127.0.0.1:8898/robots.txt... ERROR Fatal exception.'
2014-01-25 20:25:42,442: - <WatchStdout 'watch:stdout'> score: 100.0%
2014-01-25 20:25:42,446: End of session: score=100.0%, duration=0.326 second

TL;DR: ProgressRecorderSession.pre_response should be printing to self._stream not stdout!

Unusual queue inactivity

2014-02-04 10:24:28,781 - wpull.processor - DEBUG - URL Filter test <wpull.url.D
irectoryFilter object at 0x3a35450> returned True
2014-02-04 10:24:28,782 - wpull.engine - DEBUG - Session iteration for URLRecord
(url='https://lh3.googleusercontent.com/-CV5Q_-tVlx4/AAAAAAAAAAI/AAAAAAAAAQ4/5mX
xSAahlp4/s48-c/photo.jpg', status='in_progress', try_count=0, level=4, top_url='
http://www.schemer.com/sitemap', status_code=None, referrer='https://www.schemer
.com/profile/eavo5irglhag2', inline=True, link_type=None, url_encoding='utf-8', 
post_data=None) URLInfo(scheme='https', netloc='lh3.googleusercontent.com', path
='/-CV5Q_-tVlx4/AAAAAAAAAAI/AAAAAAAAAQ4/5mXxSAahlp4/s48-c/photo.jpg', query=None
, fragment='', username=None, password=None, hostname='lh3.googleusercontent.com
', port=443, raw='https://lh3.googleusercontent.com/-CV5Q_-tVlx4/AAAAAAAAAAI/AAA
AAAAAAQ4/5mXxSAahlp4/s48-c/photo.jpg', url='https://lh3.googleusercontent.com/-C
V5Q_-tVlx4/AAAAAAAAAAI/AAAAAAAAAQ4/5mXxSAahlp4/s48-c/photo.jpg', encoding='utf-8')
2014-02-04 10:24:28,783 - wpull.engine - INFO - Fetching โ€˜https://lh3.googleusercontent.com/-CV5Q_-tVlx4/AAAAAAAAAAI/AAAAAAAAAQ4/5mXxSAahlp4/s48-c/photo.jpgโ€™.
2014-02-04 10:24:28,786 - wpull.http - DEBUG - Client fetch request <Request(GET, /-CV5Q_-tVlx4/AAAAAAAAAAI/AAAAAAAAAQ4/5mXxSAahlp4/s48-c/photo.jpg, HTTP/1.1)>.
2014-02-04 10:24:28,787 - wpull.http - DEBUG - Connection pool queue request <Request(GET, /-CV5Q_-tVlx4/AAAAAAAAAAI/AAAAAAAAAQ4/5mXxSAahlp4/s48-c/photo.jpg, HTTP/1.1)>
2014-02-04 10:24:58,214 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 10:27:16,840 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 10:28:06,872 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 10:28:14,959 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 10:28:24,973 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 10:28:25,802 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 10:28:28,198 - wpull.http - DEBUG - Stream closed. active=False connected=True closed=True reading=False writing=False
2014-02-04 12:06:56,537 - __main__ - INFO - Stopping once all requests complete...
2014-02-04 12:06:56,538 - __main__ - INFO - Interrupt again to force stopping immediately.

The program had to be stopped and restarted manually. I hope this isn't a bug in Toro.

Basic password authentication support

Support fetching URLs that have a password in them.

Edit: The options needed to be implemented are:

  • --user
  • --password
  • --http-user
  • --http-password

--input-file doesn't work

--input-file isn't implemented correctly. It just uses the option as a string and reads it character by character.

Support uploading files

The options needed are:

  • --method
  • --body-data
  • --body-file

Edit: --body-data is for HTML forms like --post-data

Maybe upload file options should use something like --upload-data/--upload-file.

Fusil Fuzz: int(content_length): ValueError: invalid literal for int() with base 10: '350-'

stdout:

INFO Fetching \u2018http://127.0.0.1:8898/robots.txt\u2019.
Requesting http://127.0.0.1:8898/robots.txt... 200 OK
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 123, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 147, in _process_url_item
    session, url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 169, in _process_session
    response_factory=session.response_factory())
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 638, in fetch
    raise response from response
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 524, in _process_request
    response = yield connection.fetch(request, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 278, in fetch
    response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 315, in _process_request
    response = yield self._read_response_header(response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 353, in _read_response_header
    self._events.pre_response.fire(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/actor.py", line 24, in fire
    handler(*args, **kargs)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/recorder.py", line 95, in pre_response
    session.pre_response(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/recorder.py", line 419, in pre_response
    self._content_length = int(content_length)
ValueError: invalid literal for int() with base 10: '350-'
INFO FINISHED.
INFO Time length: 0.1 seconds.
INFO Downloaded: 0 files, 0 bytes.
INFO Exiting with status 2.

session.log:

2014-01-26 00:47:01,007: Start session
2014-01-26 00:47:01,009: Create environment variable PYTHONPATH: (len=106)
2014-01-26 00:47:01,009: Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-01-26 00:47:01,009: Stdin: /dev/null
2014-01-26 00:47:01,009: Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-40/stdout
2014-01-26 00:47:01,009: Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-01-26 00:47:01,009: Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-40
2014-01-26 00:47:01,012: Process identifier: 29446
2014-01-26 00:47:01,312: Accept client
2014-01-26 00:47:01,312: New client: <ServerClient (host 127.0.0.1, port 43614)>
2014-01-26 00:47:01,314: Read data from <ServerClient (host 127.0.0.1, port 43614)>
2014-01-26 00:47:01,324: request choice: 1
2014-01-26 00:47:01,324: Mangle header: YES
2014-01-26 00:47:01,325: Mangled data: bytearray(b'HTTP/1.0 200 OK\r\x08Server: Fusil\r\nPragma: \xe0o-cache\r\nCon\x06ent-Type:\x86text/htUl\r\nContent-Length: 350-\n\r\n')
2014-01-26 00:47:01,325: Close socket
2014-01-26 00:47:01,325: Client closed: <ServerClient (host 127.0.0.1, port 43614)>
2014-01-26 00:47:01,331: Match pattern 'exception' (score 100.0%) in 'ERROR Fatal exception.'
2014-01-26 00:47:01,333: - <WatchStdout 'watch:stdout'> score: 100.0%
2014-01-26 00:47:01,336: End of session: score=100.0%, duration=0.329 second

FTP support

Is supporting FTP needed? If so, things like http.Request need to abstracted out into something like protocol.BaseRequest.

Link conversion may not convert the actual files downloaded

Currently, Wpull uses a filename from URL and assumes the file has been stored to that filename. But, writers such as the clobbering one, will write filenames suffixed with .N where N is a number. Wpull will ignore that file with a special filename and incorrectly attempt to convert the non-specially named file.

Wget overcomes this problem by keeping track of the path of the file that was downloaded. An optimization would be only storing this filename on disk if it diverges from the assumed filename.

Fusil Fuzz: _scrape_tree: AttributeError: 'NoneType' object has no attribute 'iter'

stdout:

INFO Fetching \u2018http://127.0.0.1:8898/robots.txt\u2019.
Requesting http://127.0.0.1:8898/robots.txt... Length: 0 [text/html]

Bytes received: 0
INFO Fetched \u2018http://127.0.0.1:8898/robots.txt\u2019: 404 Not Found. Length: 0 [text/html].
INFO Fetching \u2018http://127.0.0.1:8898/\u2019.
Requesting http://127.0.0.1:8898/... Length: 45 [text/html]
.
Bytes received: 45
INFO Fetched \u2018http://127.0.0.1:8898/\u2019: 200 OK. Length: 45 [text/html].
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 123, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 147, in _process_url_item
    session, url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 190, in _process_session
    is_done = session.handle_response(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/robotstxt.py", line 111, in handle_response
    return super().handle_response(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 206, in handle_response
    return self._handle_document(response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 218, in _handle_document
    self._scrape_document(self._request, response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 319, in _scrape_document
    scraper, request, response
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/processor.py", line 329, in _process_scraper
    scrape_info = scraper.scrape(request, response)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/document.py", line 103, in scrape
    for scraped_link in self._scrape_tree(root):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/document.py", line 136, in _scrape_tree
    for element in root.iter():
AttributeError: 'NoneType' object has no attribute 'iter'
INFO FINISHED.
INFO Time length: 0.1 seconds.
INFO Downloaded: 0 files, 0 bytes.
INFO Exiting with status 1.
404 Not Found
200 OK

session.log:

2014-01-25 02:03:30,933: [15][session 31][project] Start session
2014-01-25 02:03:30,935: [15][session 31][step 2][process:python3:env] Create environment variable PYTHONPATH: (len=106)
2014-01-25 02:03:30,935: [15][session 31][step 2][process:python3:env] Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-01-25 02:03:30,935: [15][session 31][step 2][process:python3] Stdin: /dev/null
2014-01-25 02:03:30,935: [15][session 31][step 2][process:python3] Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-31/stdout
2014-01-25 02:03:30,935: [15][session 31][step 2][process:python3] Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-01-25 02:03:30,936: [15][session 31][step 2][process:python3] Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-31
2014-01-25 02:03:30,939: [15][session 31][step 2][process:python3] Process identifier: 22522
2014-01-25 02:03:31,174: [15][session 31][step 181][watch:stdout] Not matching line: 'INFO Fetching \\u2018http://127.0.0.1:8898/robots.txt\\u2019.'
2014-01-25 02:03:31,220: [15][session 31][step 214][tcp_server:(localhost):8898] Accept client
2014-01-25 02:03:31,220: [15][session 31][step 214][tcp_server:(localhost):8898] New client: <ServerClient (host 127.0.0.1, port 53772)>
2014-01-25 02:03:31,222: [15][session 31][step 215][tcp_server:(localhost):8898] Read data from <ServerClient (host 127.0.0.1, port 53772)>
2014-01-25 02:03:31,232: [15][session 31][step 215][tcp_server:(localhost):8898] Error 404: 'robots.txt'
2014-01-25 02:03:31,233: [15][session 31][step 215][tcp_server:(localhost):8898] mangle choice: 0
2014-01-25 02:03:31,233: [15][session 31][step 215][net_client:127.0.0.1:53772] Close socket
2014-01-25 02:03:31,233: [15][session 31][step 215][tcp_server:(localhost):8898] Client closed: <ServerClient (host 127.0.0.1, port 53772)>
2014-01-25 02:03:31,239: [15][session 31][step 216][watch:stdout] Not matching line: 'Requesting http://127.0.0.1:8898/robots.txt... Length: 0 [text/html]'
2014-01-25 02:03:31,239: [15][session 31][step 216][watch:stdout] Not matching line: 'Bytes received: 0'
2014-01-25 02:03:31,239: [15][session 31][step 216][watch:stdout] Not matching line: 'INFO Fetched \\u2018http://127.0.0.1:8898/robots.txt\\u2019: 404 Not Found. Length: 0 [text/html].'
2014-01-25 02:03:31,240: [15][session 31][step 216][watch:stdout] Not matching line: 'INFO Fetching \\u2018http://127.0.0.1:8898/\\u2019.'
2014-01-25 02:03:31,241: [15][session 31][step 217][tcp_server:(localhost):8898] Accept client
2014-01-25 02:03:31,241: [15][session 31][step 217][tcp_server:(localhost):8898] New client: <ServerClient (host 127.0.0.1, port 53773)>
2014-01-25 02:03:31,243: [15][session 31][step 218][tcp_server:(localhost):8898] Read data from <ServerClient (host 127.0.0.1, port 53773)>
2014-01-25 02:03:31,253: [15][session 31][step 218][tcp_server:(localhost):8898] mangle choice: 2
2014-01-25 02:03:31,254: [15][session 31][step 218][tcp_server:(localhost):8898] Mangled data: bytearray(b'\x01\x00\x01\x00l~Z\xff\x0f`y\x80\x00p<\x7f\xffndo\xff\xff-\x83{d\xec</\xfe\x80\x00\xb4Bo\x7f\xff\xff\xffV\xc1\xff\x7f\xff7')
2014-01-25 02:03:31,254: [15][session 31][step 218][net_client:127.0.0.1:53773] Close socket
2014-01-25 02:03:31,254: [15][session 31][step 218][tcp_server:(localhost):8898] Client closed: <ServerClient (host 127.0.0.1, port 53773)>
2014-01-25 02:03:31,256: [15][session 31][step 219][watch:stdout] Not matching line: 'Requesting http://127.0.0.1:8898/... Length: 45 [text/html]'
2014-01-25 02:03:31,258: [15][session 31][step 220][watch:stdout] Not matching line: '.'
2014-01-25 02:03:31,258: [15][session 31][step 220][watch:stdout] Not matching line: 'Bytes received: 45'
2014-01-25 02:03:31,259: [15][session 31][step 220][watch:stdout] Not matching line: 'INFO Fetched \\u2018http://127.0.0.1:8898/\\u2019: 200 OK. Length: 45 [text/html].'
2014-01-25 02:03:31,260: [15][session 31][step 221][watch:stdout] Match pattern 'exception' (score 100.0%) in 'ERROR Fatal exception.'
2014-01-25 02:03:31,262: [15][session 31][step 222][session 31] - <WatchStdout 'watch:stdout'> score: 100.0%
2014-01-25 02:03:31,265: [15][session 31][step 223][project] End of session: score=100.0%, duration=0.332 second

http.Connection _stream_closed_callback errors sometimes don't get caught

A connection error, like connection resets, don't seem to be caught by http.HostConnectionPool raised by http.Connection _stream_closed_callback

The log shows the error thrown, but no retry. The process was CTRL+C'ed.

2014-01-26 20:49:02,840 - wpull.http - DEBUG - Host pool got request <Request(GE
T, /page/list/cat%2C23707%2Cid_member%2C6857%2Clanguage%2CE.html, HTTP/1.1)>
2014-01-26 20:49:02,840 - wpull.http - DEBUG - Getting a connection.
2014-01-26 20:49:02,841 - wpull.http - DEBUG - Request <Request(GET, /page/list/
cat%2C23707%2Cid_member%2C6857%2Clanguage%2CE.html, HTTP/1.1)>.
2014-01-26 20:49:02,844 - wpull.http - DEBUG - Sending headers.
2014-01-26 20:49:02,846 - wpull.http - DEBUG - Sending body.
2014-01-26 20:49:02,850 - wpull.http - DEBUG - Reading header.
2014-01-26 20:57:03,350 - wpull.http - DEBUG - Stream closed. active=True connec
ted=True closed=True reading=False writing=False
2014-01-26 20:57:03,351 - wpull.http - DEBUG - Throwing error [Errno 104] Connec
tion reset by peer.
2014-01-26 20:59:28,607 - __main__ - INFO - Stopping once all requests complete.
..
2014-01-26 20:59:28,608 - __main__ - INFO - Interrupt again to force stopping im
mediately.
2014-01-26 20:59:28,608 - wpull.engine - DEBUG - Stopping. force=False
2014-01-26 20:59:33,070 - __main__ - INFO - Forcing immediate stop...
2014-01-26 20:59:33,071 - wpull.engine - DEBUG - Stopping. force=True
2014-01-26 20:59:33,072 - wpull.engine - INFO - FINISHED.

Implement --quota

The program should stop gracefully when it reaches its quota.

Edit: Removed "and --warc-max-size"

ProcessorSession.new_request doesn't pass IRI encoding to Request.new.

Not passing the encoding effectively disables IRI support.

Things need to be fixed:

  1. ProcessorSession.new_request
  2. Request.new
  3. url.py quasi quote functions should default to Latin-1 because the percent-encoded text might just binary blobs.
Traceback (most recent call last):
  File "/usr/local/lib/python3.2/dist-packages/wpull/engine.py", line 123, in _p
rocess_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.2/dist-packages/tornado/gen.py", line 557, in run
    self.yield_point.start(self)
  File "/usr/local/lib/python3.2/dist-packages/tornado/gen.py", line 399, in sta
rt
    self.result = self.future.result()
  File "/usr/local/lib/python3.2/dist-packages/tornado/concurrent.py", line 129,
 in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.2/dist-packages/tornado/gen.py", line 227, in wra
pper
    runner.run()
  File "/usr/local/lib/python3.2/dist-packages/tornado/gen.py", line 531, in run
    yielded = self.gen.send(next)
  File "/usr/local/lib/python3.2/dist-packages/wpull/engine.py", line 163, in _process_session
    request = session.new_request()
  File "/usr/local/lib/python3.2/dist-packages/wpull/processor.py", line 166, in new_request
    url_record.referrer,
  File "/usr/local/lib/python3.2/dist-packages/wpull/processor.py", line 177, in _new_request_instance
    request = self._request_factory(url)
  File "/usr/local/lib/python3.2/dist-packages/wpull/app.py", line 374, in request_factory
    request = self._classes['Request'].new(*args, **kwargs)
  File "/usr/local/lib/python3.2/dist-packages/wpull/http.py", line 46, in new
    url_info = URLInfo.parse(url)
  File "/usr/local/lib/python3.2/dist-packages/wpull/url.py", line 68, in parse
    cls.normalize_path(url_split_result.path, encoding=encoding),
  File "/usr/local/lib/python3.2/dist-packages/wpull/url.py", line 119, in normalize_path
    return quasi_quote(path, encoding=encoding) or '/'
  File "/usr/local/lib/python3.2/dist-packages/wpull/url.py", line 389, in quasi_quote
    unquote(string, encoding, errors),
  File "/usr/local/lib/python3.2/dist-packages/wpull/url.py", line 375, in unquote
    return urllib.parse.unquote(string, encoding, errors)
  File "/usr/lib/python3.2/urllib/parse.py", line 525, in unquote
    string += pct_sequence.decode(encoding, errors) + rest
  File "/usr/lib/python3.2/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 1: invalid start byte

Doesn't span hosts for URL redirects.

In Wget, redirects seem to be strongly followed despite not specifying --span-hosts. eg: www.schemer.com (302) โ†’ accounts.google.com (302) โ†’ www.schemer.com (200). It also ignores --exclude-domains and possible other options.

Wpull doesn't follow this semantic. It either should or not, but it should also provide an option like --no-strong-redirects or --strong-redirects depending on whether which is best.

Support --output-document

This option concatenates all the files into one stream.

Note: remember to give a warning that this option is potentially harmful.

Performance problem with getting URL table size for debug logging

2014-01-26 14:18:44,166 - wpull.engine - DEBUG - Sleeping 0.2512116035936583.
2014-01-26 14:18:47,374 - wpull.engine - DEBUG - Table size: 915866.
2014-01-26 14:18:47,379 - wpull.engine - DEBUG - Get next URL todo.
2014-01-26 14:18:47,388 - wpull.engine - DEBUG - Return record Record('http://im

We shouldn't be querying the size of the table since it appears to be costly.

Provide a wpull command

Wpull should provide a wpull command through setup.py's scripts parameter. Currently, the user is required to invoke the python interpreter with python -m wpull.

However, a user can install both Python 2 and 3 versions at the same time and the command script from both installs will get clobbered. This is not ideal because a user may need to run Wpull with scripting that only works with one version of Python.

A possible solution is that Python 3 version installs "wpull" and "wpull3" and Python 2 version installs only "wpull2".

Fusil Fuzz: parse_status_line: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 1: invalid start byte

stdout:

INFO Fetching \u2018http://127.0.0.1:8898/robots.txt\u2019.
Requesting http://127.0.0.1:8898/robots.txt... ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 123, in _process_input
    yield self._process_url_item(url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 147, in _process_url_item
    session, url_item)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/engine.py", line 169, in _process_session
    response_factory=session.response_factory())
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 636, in fetch
    raise response from response
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 522, in _process_request
    response = yield connection.fetch(request, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 276, in fetch
    response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 505, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 313, in _process_request
    response = yield self._read_response_header(response_factory)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 496, in run
    next = self.yield_point.get_result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 395, in get_result
    return self.runner.pop_result(self.key).result()
  File "/usr/local/lib/python3.3/dist-packages/tornado/concurrent.py", line 129, in result
    raise_exc_info(self.__exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.3/dist-packages/tornado/stack_context.py", line 302, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 550, in inner
    self.set_result(key, result)
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 476, in set_result
    self.run()
  File "/usr/local/lib/python3.3/dist-packages/tornado/gen.py", line 507, in run
    yielded = self.gen.send(next)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 348, in _read_response_header
    status_line)
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/http.py", line 87, in parse_status_line
    return to_str((groups[0], int(groups[1]), groups[2]))
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/util.py", line 126, in to_str
    return tuple([to_str(item, encoding) for item in instance])
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/util.py", line 126, in <listcomp>
    return tuple([to_str(item, encoding) for item in instance])
  File "/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/wpull/util.py", line 122, in to_str
    return instance.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 1: invalid start byte
INFO FINISHED.
INFO Time length: 0.1 seconds.
INFO Downloaded: 0 files, 0 bytes.
INFO Exiting with status 2.

session.log:

2014-01-25 02:03:38,521: [18][session 36][project] Start session
2014-01-25 02:03:38,523: [18][session 36][step 2][process:python3:env] Create environment variable PYTHONPATH: (len=106)
2014-01-25 02:03:38,523: [18][session 36][step 2][process:python3:env] Environment: {'PYTHONPATH': '/home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/../..'}
2014-01-25 02:03:38,523: [18][session 36][step 2][process:python3] Stdin: /dev/null
2014-01-25 02:03:38,523: [18][session 36][step 2][process:python3] Stdout filename: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-36/stdout
2014-01-25 02:03:38,523: [18][session 36][step 2][process:python3] Create process: ['/usr/bin/python3', '-m', 'wpull', '127.0.0.1:8898', '--timeout', '2.0', '--tries', '1']
2014-01-25 02:03:38,524: [18][session 36][step 2][process:python3] Working directory: /home/chris/data-chris-aspire-5741/Documents/programming/wget-remake/wget-remake.git/test/fuzz_fusil/fusil/session-36
2014-01-25 02:03:38,531: [18][session 36][step 2][process:python3] Process identifier: 22568
2014-01-25 02:03:38,775: [18][session 36][step 176][watch:stdout] Not matching line: 'INFO Fetching \\u2018http://127.0.0.1:8898/robots.txt\\u2019.'
2014-01-25 02:03:38,809: [18][session 36][step 200][tcp_server:(localhost):8898] Accept client
2014-01-25 02:03:38,809: [18][session 36][step 200][tcp_server:(localhost):8898] New client: <ServerClient (host 127.0.0.1, port 53784)>
2014-01-25 02:03:38,812: [18][session 36][step 201][tcp_server:(localhost):8898] Read data from <ServerClient (host 127.0.0.1, port 53784)>
2014-01-25 02:03:38,822: [18][session 36][step 201][tcp_server:(localhost):8898] Error 404: 'robots.txt'
2014-01-25 02:03:38,823: [18][session 36][step 201][tcp_server:(localhost):8898] mangle choice: 1
2014-01-25 02:03:38,823: [18][session 36][step 201][tcp_server:(localhost):8898] Mangled data: bytearray(b'HTTP/1.0 404 N\x99t \x0eounz\r\nServer\x00\x80F\x8bsil\r\nPra\x19M\xfe: no-\xfe\xff\xff\xffe\r\x80\x00onten\xcf-pype: text/html\r\nConten=-Length: 0\r\n\r\n')
2014-01-25 02:03:38,823: [18][session 36][step 201][net_client:127.0.0.1:53784] Close socket
2014-01-25 02:03:38,823: [18][session 36][step 201][tcp_server:(localhost):8898] Client closed: <ServerClient (host 127.0.0.1, port 53784)>
2014-01-25 02:03:38,830: [18][session 36][step 203][watch:stdout] Match pattern 'exception' (score 100.0%) in 'Requesting http://127.0.0.1:8898/robots.txt... ERROR Fatal exception.'
2014-01-25 02:03:38,832: [18][session 36][step 204][session 36] - <WatchStdout 'watch:stdout'> score: 100.0%
2014-01-25 02:03:38,835: [18][session 36][step 205][project] End of session: score=100.0%, duration=0.314 second

Lua scripting support

Allow for lua scripting support similar to github.com/ArchiveTeam/wget-lua. The callback parameters likely cannot be the same, but it should offer the same functionality. github.com/bastibe/lunatic-python looks promising.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.