nla / httrack2warc Goto Github PK

View Code? Open in Web Editor NEW

29.0 20.0 6.0 159 KB

Converts HTTrack crawls to WARC files

License: Apache License 2.0

Java 99.68% Shell 0.32%

web-archiving

httrack2warc's People

Contributors

Stargazers

Watchers

Forkers

syzyyp aponb rohancheri bulksecuritygeneratorprojectv2 kmccarp wulin-challenge

httrack2warc's Issues

Exception thrown on crawls without --debug-headers

Exception in thread "main" java.nio.file.NoSuchFileException: .../hts-ioinfo.txt
	...
        at java.nio.file.Files.newInputStream(Files.java:152)
	at au.gov.nla.httrack2warc.httrack.HttrackCrawl.parseIoinfo(HttrackCrawl.java:50)
	at au.gov.nla.httrack2warc.httrack.HttrackCrawl.<init>(HttrackCrawl.java:46)
	at au.gov.nla.httrack2warc.Httrack2Warc.convert(Httrack2Warc.java:71)
	at au.gov.nla.httrack2warc.Main.main(Main.java:103)

Remove Transfer-Encoding header

Even when we have the headers from the HTTrack debug log we don't have the original transfer-encoded bytes of the response message so we should remove the header before writing the WARC as the WARC file is supposed to contain the encoded response as it was on the wire.

Unexpected character

I tried to view a warc file just now with openwayback and it outputs the following. Is this a problem with the warc or with httrack2warc?

WARNING: Bad Record. Trying skip (Record start 782): Unexpected character 41(Expecting d)
Mar 02, 2020 10:47:38 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork
SEVERE: FAILED to index or upload (crawl.warc)
java.lang.RuntimeException: After retry (Offset 782)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourceindex.updater.IndexClient.addSearchResults(IndexClient.java:158)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.doWork(IndexWorker.java:111)
	at org.archive.wayback.resourcestore.indexer.IndexWorker$WorkerThread.run(IndexWorker.java:244)
Caused by: java.io.IOException: Unexpected character 43(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader.get(ArchiveReader.java:144)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:562)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
	... 9 more

The command I used to download the website:

httrack "https://web.archive.org/web/20180611033123/https://github.com/adlio/usgs-waterdata/tree-commit/89c97a80cdd6fba90972fd137fcd5a7a92ad1fff" '-*' '+https://web.archive.org/web/20180611033123*' '+https://archive.org/includes*' '+https://web.archive.org/_static*' '+https://archive.org/images*' '+https://archive.org/services*' '+https://archive.org/components*' '+https://www.archiveteam.org*' -N1005 --advanced-progressinfo --can-go-up-and-down --display --keep-alive --mirror --robots=0 --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' --verbose

The command I used to create the warc:

java -jar /Users/fabiansturm/Documents/projects/httrack2warc/target/httrack2warc-0.4.0-shaded.jar /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/webcache-download731331670 -o /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/http2warc115706301 -C none

After that I renamed it to crawl.warc since I used -C none.

To run the container:

docker pull iipc/openwayback
docker container run -it --rm -v /tmp/owb:/data -p 8089:8080 iipc/openwayback

Non-200 status code handling

In 3.49-2 we have:

hts-cache/new.txt:11:21:41	185/185	---M--	301	error ('Moved%20Permanently')	text/html	date:Tue,%2009%20Jan%202018%2002:21:41%20GMT	http://test.example.org/redirect	test.example.org/redirect	(from http://test.example.org/)
Binary file hts-cache/new.zip matches
hts-ioinfo.txt:[1] request for test.example.org/redirect:
hts-ioinfo.txt:<<< GET /redirect HTTP/1.1
hts-ioinfo.txt:[1] response for test.example.org/redirect:

the new.zip comment entry has:

HTTP/1.1 301 Moved Permanently
X-In-Cache: 1
X-StatusCode: 301
X-StatusMessage: Moved Permanently
X-Size: 185
Content-Type: text/html
Last-Modified: Tue, 09 Jan 2018 02:21:41 GMT
Location: http://test.example.org/another
X-Addr: test.example.org
X-Fil: /redirect
X-Save: test.example.org/redirect

these are converted ok if hts-ioinfo is present. But without hts-ioinfo currently a resource record is created.

I don't think a cache entry is present at all in early versions of HTTrack. It might be possible to recreate redirects from the log messages though.

Escaping in new.txt and new.zip do not match

HTTrack appears to write the URL in new.txt escaped (e.g. spaces replaced with %20) but unescaped in new.zip. This causes cache lookup error when the two forms do not match:

Exception in thread "main" java.io.IOException: no cache entry: http://example.org/some%20file.jpg
    at au.gov.nla.httrack2warc.httrack.HttrackCrawl.buildRecord(HttrackCrawl.java:148)

It appears in the new.txt entry context HTTrack is escaping the following characters:

spaces
double-quotes
character codes <= 31
character codes >= 127

Notably this does not include the % character. Therefore this transformation is not safely reversible.

Handle image errors renamed to .html

Requests for URLs with an image file extension (e.g. foo.gif) might return a HTML 404 error message. In this case HTTrack appears to write the error message to a file named foo.html but still refers to it as foo.gif in the cache and in new.txt.

I've worked around this for now by allowing the skipping of missing files if they would have an HTTP error status code. Is there a way we can detect and handle this case properly? Maybe we can implement the same conditions HTTrack has for renaming the files and probe for their existence.

HTTrack 3.01 support

different hts-ioinfo.txt format
split logs structure
older ndx cache format

nla / httrack2warc Goto Github PK

httrack2warc's People

Contributors

Stargazers

Watchers

Forkers

httrack2warc's Issues

Exception thrown on crawls without --debug-headers

Remove Transfer-Encoding header

Unexpected character

Non-200 status code handling

Escaping in new.txt and new.zip do not match

Handle image errors renamed to .html

HTTrack 3.01 support

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent