iipc / openwayback Goto Github PK
View Code? Open in Web Editor NEWThe OpenWayback Development
Home Page: http://www.netpreserve.org/openwayback
License: Apache License 2.0
The OpenWayback Development
Home Page: http://www.netpreserve.org/openwayback
License: Apache License 2.0
Taken from document's "Schedule for next steps", 2.0.0-Beta-1.
All outstanding issues to be copied from the issues spreadsheet.
There are probably some steps from the initial document which need to be logged too.
I've been attempting to re-use the StaticMapExclusionFilterFactory code elsewhere, and although it says it can cope with SURT prefixes, I'm finding that they fail to load. The problem is that the system appears to catch the wrong exception.
This says it captures URIException, but I'm seeing them get through:
org.apache.commons.httpclient.URIException: gnu.inet.encoding.IDNAException: Contains non-LDH characters. (org,
It would be interesting if the wayback machine provided ways to do visual diffs on webpages via resemble.js and perhaps html diffs as well.
Use CDX server for this. Documentation required.
S'sheet line: 14
For whom? BNF
Notes: Error messages don't fit either when frames are small.
Est. Milestone: 2.x.x
Live Archiving Proxy integration, requires SSL handling.
S'sheet line: 26
For whom? INA, BL, IA
Notes: Separate project, integrating LAP and Proxy-mode Wayback? e.g. forward request to LAP? Largely a documentation issue.
Est. Milestone: 2.0.x
Support WARC revisit, inc. URL-agnostic.
S'sheet line: 4
For whom? BNF, BL, DN
Notes: Revisits fixed. URL-agnostic working but at edge of spec?
Est. Milestone: 2.x.x
S'sheet line: 29
For whom? BNF, NULI, BL, DN
Notes: There are holes, but framework ok. How to review? Consider switch to JSTL?
Est. Milestone: 2.0.x
S'sheet line: 7
Notes: Should work now?
Est. Milestone: Ilya to check
S'sheet line: 8
For whom? BNF, BL, DN
Notes: Same problem as for LAP. Documentation challenge. Look at LAP certificate magic.
Est. Milestone: 2.x.x
Better display of information panel/toolbar into frames and iframes.
S'sheet line: 13
For whom? BNF
Notes: Wayback toolbar multiple times, one in each frame. No scaling/frame-sense. At least the date in small frames. Perhaps test-suite driven solution?
Est. Milestone: 2.x.x
The goal is to reduce linkage to IA Maven artefacts, so that we can build using a SonaType POM and make clean releases. Should also enable Travis CI builds to work, and ensure that proper releases will go to Maven Central.
S'sheet line: 27
For whom? BNF, NULI, BL, DN, IA
Notes: Document overlay stuff, keep UI stuff together/modular.
Est. Milestone: TBC
The code is critically dependent on the heritrix-commons codebase, mainly for the WARC readers/writers. API changes between 3.1.1 and 3.1.2-SNAPSHOT mean that we cannot rely on a proper release at the moment.
This is really rather bad practice for code stability, as the dependant code can shift from underneath us, and this is even more pressing as the two codebases are now under the control of different groups.
Easiest solution is probably for IA to make a 3.1.2 release. Best solution is probably to pull the ARC/WARC code out into a separate project AND/OR shift over to the JWAT implementation. Clumsy fallback would be to make an IIPC release of H3 and depend on that instead.
S'sheet line: 10
For whom? BNF, DN
Notes: CDX/indexing consequences? Need a test case. Heritrix issues, maybe just H1, so need H1 and H3 test cases.
Est. Milestone: 2.x.x
PrefixFieldCollapser, RegexFieldMatcher is missing while compiling CDXServer.java - any Chance to add these classes to the repo? Thx
Pool examples? Plugin module? Host-based rules, over time. Put together with canonicalisation rules topic.
Display CDX metadata, e.g. WARC filename.
S'sheet line: 11
For whom? BNF, DN
Notes: Option to show somewhere in UI.
Est. Milestone: 2.x.x
Dynamic JSP inserts (based on host).
S'sheet line: 24
For whom? BNF
Notes: Currently implemented in JavaScript.
Est. Milestone: 2.x.x
S'sheet line: 19
For whom? BNF
Notes: Currently, BDB IP hooked to timestamp. Requires wayback changes? Browser extension?
Est. Milestone: TBC
Documentation/tutorial example?
S'sheet line: 30
For whom? NULI, BL
Notes: Use CDX server for this. Documentation required.
Est. Milestone: 2.0.x
Scale: # hits on host, aggregate and clustering results.
S'sheet line: 2
For whom? BNF, BL, DN, IA
Notes: New CDX server should enable this.
Est. Milestone: 2.x.x
S'sheet line: 25
For whom? NLN
Notes: Fixed recently? Test suite? in Heritrix Commons. Probably only GZIP sniffing.
S'sheet line: 6
For whom? BNF, BL, DN
Notes: CDX/indexing consequences.
Est. Milestone: Ilya to check
S'sheet line: 28
For whom? ALL
Est. Milestone: 2.0.x
Document canonicalisation rules - current state of play.
S'sheet line: 23
For whom? ALL
Est. Milestone: 2.0.x
Clarification needed? IDN?
S'sheet line: 21
For whom? BNF
Notes: Supported, required documentation
Est. Milestone: 2.0.x
Hello, Is it possible to add example of livewebPrefix to wayback.xml?
This one did not work for me:
After:
< property name="replayPrefix" value="${wayback.urlprefix}" />
< property name="queryPrefix" value="${wayback.urlprefix}" />
< property name="staticPrefix" value="${wayback.urlprefix}" />
Added:
< property name="livewebPrefix" value="${wayback.urlprefix}/liveweb/" />
Follow to the documentation - http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html and https://github.com/internetarchive/wayback/blob/master/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java I need to add it.
S'sheet line: 15
For whom? BNF, IA
Notes: Download current page as (W)ARC. Multiple pages, e.g. every page visited in session, c.f. Zotero. Separate component that talks to the proxy? Solution not clear.
Est. Milestone: TBC
S'sheet line: 5
For whom? BNF, BL, DN
Notes: CDX/indexing consequences
Est. Milestone: Ilya to check.
Displaying a large number of results (UI).
S'sheet line: 3
For whom? BNF, BL, DN, IA
Est. Milestone: 2.x.x
S'sheet line: 12
For whom? IA
Notes: Needs architectural discussion? e.g. browsing inlinks. Common UI changes?
Est. Milestone: TBC
Decide on need to aggregate stats and then do some UI design.
Documentation exercise? FTP example were probably resource records. CDX and custom UI renderer required? Test case needed.
For whois records, WARCRecordToSearchResultAdapter.java needs a new case added to handle resource records.
(Ilya, see https://webarchive.jira.com/browse/ARI-3552)
/data/arcs/archive-it/ARCHIVEIT-4000-NONE-25860-20131018173129550-00000-wbgrp-crawl055.us.archive.org-6444.warc.gz
java.io.IOException: Failed parse of http status line.
at org.archive.io.RecoverableIOException.(RecoverableIOException.java:36)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:273)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:105)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:74)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:52)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:54)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:209)
java.io.IOException: Failed parse of http status line.
at org.archive.io.RecoverableIOException.(RecoverableIOException.java:36)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:273)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:105)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:74)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:52)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:54)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:209)
java.io.IOException: Failed parse of http status line.
at org.archive.io.RecoverableIOException.(RecoverableIOException.java:36)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:273)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:105)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:74)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:52)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:54)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:52)
at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:209)
S'sheet line: 17
For whom? BNF
Notes: Memento in Proxy Mode?
Est. Milestone: TBC
Dynamic collections unders same replay point. Proxy is fixed, so switching collections needed, added to session.
S'sheet line: 20
For whom? BNF
Notes: Related case for Archive-IT using HTTP Basic. Perhaps an API consideration.
Est. Milestone: TBC
S'sheet line: 18
For whom? BNF
Notes: Requires better documentation?
Est. Milestone: 2.0.x
Related to #25.
S'sheet line: 9
Notes: Easier, but need non HTTPS entry point. How to enable?
Est. Milestone: DONE?
As was discussed over on SourceForge it looks like Wayback's use of jQuery is causing some problems because it uses an older version (v1.3.2) which can stomp on previously loaded versions, which can break functionality on the page. I think it may also interfere with any jQuery plugins that have already been installed.
I don’t know what the best solution is, but it seems to me there are (at least) four options:
Personally, I think 4 is probably the best option, to keep things as simple as possible. But it's not entirely clear to me why jQuery was added, and what it and the associated plugins, are currently used for.
Move to Spring?
While porting for #10, this happened:
One issue I noticed was that the archive-access code brings in entire heritrix-commons just for one class, which appears to be quite general purpose:
import org.archive.net.PublicSuffixes;
(indeed, there is a Google Guava class that does pretty much the same thing). This seems a little over the top, so I copied the PublicSuffixes to iipc-web-commons under the org.archive.url package, along with the corresponding unit tests and effective_tld data file.
This is rather clumsy, and given this is provided by Google Guava, there seems little point maintaining our own code (assuming theirs is kept up to date). The task is then to check that the Google one is well maintained and switch over to that instead of copying in code from elsewhere.
Hi all,
Arc indexer send an exception like Created (escaped) uuri > 2083 and the indexation process stop it. How can I solve this problem?
Thanks all
Needs review for best way forward. Suggestion that The Oracle is not to be trusted (too old). So need to collect use cases first.
S'sheet line: 22
For whom? BNF
Notes: e.g hash-bang URIs, cross-component issues. Share examples first? API consideration. Shared rule-bank, versioned, between H and W. Perhaps a shared rule framework with local rules. Look at the two Wayback canonicalisation systems.
Est. Milestone: TBC
S'sheet line: 16
For whom? BNF
Est. Milestone: 2.x.x
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.