machawk1 / warcreate Goto Github PK
View Code? Open in Web Editor NEWChrome extension to "Create WARC files from any webpage"
Home Page: https://warcreate.com
License: MIT License
Chrome extension to "Create WARC files from any webpage"
Home Page: https://warcreate.com
License: MIT License
This is necessary for the move of the logic to onload from warc creation time.
This might have to do with
Example can be seen in the doctype declaration in the produced WARC.
The method uses to capture CSS can probably be used here.
The host looks to not be prepended but the path portion is included in the WARC record.
An example:
WARC/1.0
WARC-Type: request
WARC-Target-URI: _css/azul_movil.css
WARC-Date: 2013-03-20T17:58:32.560Z
WARC-Concurrent-To: urn:uuid:02d04100-1a67-a427-559c-24f7889dd8ce
WARC-Record-ID: urn:uuid:72dff350-fd0d-95b2-a778-1ed78556a0ce
Content-Type: application/http; msgtype=request
Content-Length: 325
GET /s HTTP/1.1
Host: undefined
Connection: close
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.152 Safari/537.22
Accept-Encoding: gzip
Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7
Cache-Control: no-cache
Accept-Language: de,en;q=0.7,en-us;q=0.3
WARC/1.0
WARC-Type: response
WARC-Target-URI: _css/azul_movil.css
WARC-Date: 2013-03-20T17:58:32.560Z
WARC-Record-ID: urn:uuid:6ddadec2-be01-8694-304e-62a3d37ac3fa
Content-Type: application/http; msgtype=response
Content-Length: 604
HTTP/1.1 200 OK
Content-Type: text/css
Date: Wed Mar 20 2013 13:58:32 GMT-0400 (EDT) GMT
Last-Modified: Wed Mar 20 2013 13:58:32 GMT-0400 (EDT) GMT
Server: Apache/2.2.17 (Unix) PHP/5.3.5 mod_ssl/2.2.17 OpenSSL/0.9.8q
Accept-Ranges: bytes
Content-Type: text/css
The requested URL /getThatText.php was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.
Currently, asynchronous requests from other tabs pollute the WARC if they occur during the WARC creation process.
For example, in Mediawiki, the โโ character is saved as character with hex 92.
nasa.gov
cnn.com
e.g., if I go to http://matkelly.com/_images/logo.png and say "Create WARC", the WARC created contains an HTML page with the image, likely created by the browser to display the image. Further, the binary information is not present.
Which means the JS versions are not correct.
Thanks for the discovery, @nlevitt
Currently, there is little more than inline documentation.
It's questionable whether this should even be included but for the sake of comprehensive preservation maybe it should be fetched
The faux naming is a remnant of the reconstructions for image data acquisition and encoding in the WARC file.
This was only applicable to images that required a subsequent Ajax call, e.g., embedded in CSS.
Fixing this would probably fix a few other issues down the line.
Rely on WARC data contained within the extension (or on the website) with some diff threshold to allow for (e.g.) date discrepancies .
A la the Ghostery plugin. This will give folks an intro to what it is an initial options to set.
Possibly implement this in a contextual dropdown option, which would allow filtering based on the current tab's context.
As suggested by Noah Levitt @ internet archive.
See bug #14 .
Status code and status message are separate variables. It doesn't look like the true HTTP status line returned exists in the jsxhr object, so I've had to manually add HTTP/1.1, which might not always be the case.
As suggested by Noah Levitt @ internet archive.
This would be handy for creating revisit records.
http://updates.html5rocks.com/2012/06/Don-t-Build-Blobs-Construct-Them
BlobBuilder is deprecated with some revision to the HTML5 File API. Move over to Blob interface from BlobBuilder
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.