Giter Site home page Giter Site logo

httrack2warc's People

Contributors

ato avatar dependabot[bot] avatar jlleitschuh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

httrack2warc's Issues

Exception thrown on crawls without --debug-headers

Exception in thread "main" java.nio.file.NoSuchFileException: .../hts-ioinfo.txt
	...
        at java.nio.file.Files.newInputStream(Files.java:152)
	at au.gov.nla.httrack2warc.httrack.HttrackCrawl.parseIoinfo(HttrackCrawl.java:50)
	at au.gov.nla.httrack2warc.httrack.HttrackCrawl.<init>(HttrackCrawl.java:46)
	at au.gov.nla.httrack2warc.Httrack2Warc.convert(Httrack2Warc.java:71)
	at au.gov.nla.httrack2warc.Main.main(Main.java:103)

Remove Transfer-Encoding header

Even when we have the headers from the HTTrack debug log we don't have the original transfer-encoded bytes of the response message so we should remove the header before writing the WARC as the WARC file is supposed to contain the encoded response as it was on the wire.

Unexpected character

I tried to view a warc file just now with openwayback and it outputs the following. Is this a problem with the warc or with httrack2warc?

WARNING: Bad Record. Trying skip (Record start 782): Unexpected character 41(Expecting d)
Mar 02, 2020 10:47:38 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork
SEVERE: FAILED to index or upload (crawl.warc)
java.lang.RuntimeException: After retry (Offset 782)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourceindex.updater.IndexClient.addSearchResults(IndexClient.java:158)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.doWork(IndexWorker.java:111)
	at org.archive.wayback.resourcestore.indexer.IndexWorker$WorkerThread.run(IndexWorker.java:244)
Caused by: java.io.IOException: Unexpected character 43(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader.get(ArchiveReader.java:144)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:562)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
	... 9 more

The command I used to download the website:

httrack "https://web.archive.org/web/20180611033123/https://github.com/adlio/usgs-waterdata/tree-commit/89c97a80cdd6fba90972fd137fcd5a7a92ad1fff" '-*' '+https://web.archive.org/web/20180611033123*' '+https://archive.org/includes*' '+https://web.archive.org/_static*' '+https://archive.org/images*' '+https://archive.org/services*' '+https://archive.org/components*' '+https://www.archiveteam.org*' -N1005 --advanced-progressinfo --can-go-up-and-down --display --keep-alive --mirror --robots=0 --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' --verbose

The command I used to create the warc:

java -jar /Users/fabiansturm/Documents/projects/httrack2warc/target/httrack2warc-0.4.0-shaded.jar /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/webcache-download731331670 -o /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/http2warc115706301 -C none

After that I renamed it to crawl.warc since I used -C none.

To run the container:

docker pull iipc/openwayback
docker container run -it --rm -v /tmp/owb:/data -p 8089:8080 iipc/openwayback

Non-200 status code handling

In 3.49-2 we have:

hts-cache/new.txt:11:21:41	185/185	---M--	301	error ('Moved%20Permanently')	text/html	date:Tue,%2009%20Jan%202018%2002:21:41%20GMT	http://test.example.org/redirect	test.example.org/redirect	(from http://test.example.org/)
Binary file hts-cache/new.zip matches
hts-ioinfo.txt:[1] request for test.example.org/redirect:
hts-ioinfo.txt:<<< GET /redirect HTTP/1.1
hts-ioinfo.txt:[1] response for test.example.org/redirect:

the new.zip comment entry has:

HTTP/1.1 301 Moved Permanently
X-In-Cache: 1
X-StatusCode: 301
X-StatusMessage: Moved Permanently
X-Size: 185
Content-Type: text/html
Last-Modified: Tue, 09 Jan 2018 02:21:41 GMT
Location: http://test.example.org/another
X-Addr: test.example.org
X-Fil: /redirect
X-Save: test.example.org/redirect

these are converted ok if hts-ioinfo is present. But without hts-ioinfo currently a resource record is created.

I don't think a cache entry is present at all in early versions of HTTrack. It might be possible to recreate redirects from the log messages though.

Escaping in new.txt and new.zip do not match

HTTrack appears to write the URL in new.txt escaped (e.g. spaces replaced with %20) but unescaped in new.zip. This causes cache lookup error when the two forms do not match:

Exception in thread "main" java.io.IOException: no cache entry: http://example.org/some%20file.jpg
    at au.gov.nla.httrack2warc.httrack.HttrackCrawl.buildRecord(HttrackCrawl.java:148)

It appears in the new.txt entry context HTTrack is escaping the following characters:

  • spaces
  • double-quotes
  • character codes <= 31
  • character codes >= 127

Notably this does not include the % character. Therefore this transformation is not safely reversible.

Handle image errors renamed to .html

Requests for URLs with an image file extension (e.g. foo.gif) might return a HTML 404 error message. In this case HTTrack appears to write the error message to a file named foo.html but still refers to it as foo.gif in the cache and in new.txt.

I've worked around this for now by allowing the skipping of missing files if they would have an HTTP error status code. Is there a way we can detect and handle this case properly? Maybe we can implement the same conditions HTTrack has for renaming the files and probe for their existence.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.