Giter Site home page Giter Site logo

[discussion/thought] Would a custom browser solution work better in terms of capabilities/UI than most current tools/proxies? about warcreate HOT 14 OPEN

hanoii avatar hanoii commented on August 22, 2024
[discussion/thought] Would a custom browser solution work better in terms of capabilities/UI than most current tools/proxies?

from warcreate.

Comments (14)

N0taN3rd avatar N0taN3rd commented on August 22, 2024 2

To put my 2 cents in, the way warcreate does things is the way to do it. There is an alternative way to do things but it would be painful due to the limitations of the browser and thus the contribution opportunity is still open. Simular contribution opportunities are open for the other projects mentioned here (all welcome them with open arms) if you can improve them to fit your needs and or welcome detailed issues regarding their short coming.

@hanoii If you have any questions etc feel free to contact myself or @ikreymer.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024 1

What sort of other stuff? Some browsers also natively support the HAR format and @ikreymer even created a library to convert from har2warc, so there may be potential there with regard to preservation.

I need to keep, if possible, better consumable media. The WARC part is probably the default archiving to the replaying side of things, but it might make sense to have screenshots, annotations, tagging, and then maybe storing it somewhere. Attempting to fetch youtube videos and/or facebook through youtube-dl could come handy, so exploring the option to do that from, at least for now, an electron app at least as an initial PoC tool.

For the storage side of things I think https://www.archivematica.org could potentially work.

And if I go (or suggest) the electron route, I think some of what you did could be either re-used or serve an inspiration and if I do use it or look more closely to it I am sure gonna have feedback on it.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024 1

@machawk1 also just saw https://github.com/N0taN3rd/node-warc from @N0taN3rd which is likely to help.

from warcreate.

N0taN3rd avatar N0taN3rd commented on August 22, 2024 1

The alternative I was alluding to is going the route of how Squidwarc does preservation using Chrome Devtools Protocol. Chrome extensions can use the CDP via the chrome.debugger API. The "painful" part of it is the limitations enforced on blobs and preservation could not operate per tab, it would have to use a separate tab for "crawling" the pages to be preserved.

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

@hanoii I agree. The tool one uses to view the Web ought to be the same tool that is used to archive it. Daily driving browsers (e.g., Chrome and Firefox) do not natively support writing or read WARCs.

Tools exist that leverage a separate browser/tool to generate a WARC, but needing to switch tools to archive is not ideal and a reason WARCreate exists -- to allow the same tools one uses daily (here, Chrome) to also archive the Web. Having an ad hoc fork of Chromium with WARC support, while interesting, suffers from the same ad hoc problem.

The good news is that since WARCreate's inception (~2011 πŸ˜…), some tools are moving toward leveraging your regular browser experience for preservation. For example, Squidwarc is working toward using your browser's own cookies to provide an additional rich result of captures behind authentication.

Thanks for the feedback and interest. I welcome further discussion.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

@machawk1 thanks for the reply.

For me having a separate tool is not that much of a problem, it actually a benefit for me as it allows to compartment the archival process on its own process, helps with privacy as you can log in there with different credentials, etc and it allows for add on tooling with independence on catching up with new technologies and site anti-crawling techniques, but I agree if all can live on the same tool, it's interesting.

Forking chrome is still something I am just toying with, although I can see it being not an easy work.

The good thing of having your own tool is that you can do a lot of stuff, not just creating the warc file, screenshots, annotations and even access to external tools (youtube-dl) becomes possible as in an extension is always limited to what the host allows.

I still think this is a great approach and I am looking forward to trying the new version as you mentioned that chrome store has a stale on #111.

Also I kind of need installing it/using not to be too hard as this is for a students audience for research/investigation on a specific topic.

Squidwarc and warcprox are also interesting approaches - I haven't played a lot with squidwarc yet.

Right now I am also exploring electron as a middle term alternative. It's chromium, you don't need to build it, you can run native applications and you can access crhomium api like the chrome.webrequest. Still it has its issues that you have to handle manually.

I will appreciate thoughts as well. I need to recommend/estimate different options, for something that's going to be funded as well as being later open source, so understanding different concepts and problems that you had to overcome are super useful.

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

@hanoii There has been some discussion relatively recently on approaches toward preserving the Web. I think WARCreate has some merit on easy of installation (click a button in the Chrome store) and usage (a single button to generate a WARC) at the expense of it being novel at a time where no software libraries existed on which to build WARC files and the browser APIs to do so were inadequate.

With that said, it is an approach (reusing the user's browser) where other WARC-generation tools have their own. For example:

  • Squidwarc leverages a browser internally for rich crawling results (instead of a single page) and (unconfirmed) can reuse cookies.
  • Webrecorder allows WARCs to be generated while the user browses
  • Brozzler takes a more distributed crawling approach, leverages a browser and external tools (e.g., youtube-dl, as you mentioned)

As another data point, WAIL (Electron) is an Electron app that attempts to bundle browser-based crawling into a native interface. For disclosure, I was the creator of the original WAIL application that I developed to fill a shortcoming of browser APIs at the time for WARCreate (more info) but the re-imaging is the handy-work of @N0taN3rd, who is now an employee of @webrecorder and creator of Squidwarc.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

@machawk1 I saw all of them. Webrecorder didn't play that nicely with facebook unfortunately, which is the site I am trying most sites with as it's one we are mostly interested in archiving and likely very complex.

I saw WAIL, it's based on older stack of both electron and pywb but I might certainly get to see it. I didn't expect the tools to work right out of the box, so also trying to chose the tool I could contribute more to. The one thing I like about warcreate is that its codebase is manageable, and all javascript. Still I believe a bit more flexibility around not just being able to get a WARC archive but maybe other stuff.

HOw would warcreate behave with streaming media?

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

not just being able to get a WARC archive but maybe other stuff.

What sort of other stuff? Some browsers also natively support the HAR format and @ikreymer even created a library to convert from har2warc, so there may be potential there with regard to preservation.

@N0taN3rd is more than aware of WAIL-Electron using older versions of Electron and pywb. I continually encourage him to keep developing it despite his new affiliation. I am hoping the pings in this thread will serve as reassurance to the continued need of an app like his. ;-)

I have not done extensive validation of WARCreate with regard to streaming media in a while, so am unsure. Some more testing is in order and while I have appreciated user feedback in the past, have been unable to attract development cycles from others despite the codebase being all-JS.

I am unsure if this is because of the nature of the audience or the quality of the project being a detractor. For example, many users that want a simple non-technical solution may not have coding experience. On the flip-side, those that can may not due to the search for more technical solutions.

Any suggestions you have on making the tool more useful and functional from a technical perspective would be appreciated. A lot of the feedback has been high level, which is useful for the conveyed use case, but generally does not improve the software overall.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

@machawk1 Is there anything you can share on the path throughout building this tool in terms of big pitfalls that you found, any unorthodox thing you might have done in order to sort out pitfalls or the like.

It seems to be that https://developer.chrome.com/extensions/webRequest is a lot of what you need but a quick but did you had to rely heavily on other APIs. I will definitely go though the code in more depths but it's always helpful to have an overall approach/difficulties on your mind while going about things.

I am making good progress on the electron side of things. It's great how it has progressed.

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

@hanoii The capabilities and scope of Chrome extensions have come a long way since I originally created WARCreate. There was no webRequest API initially, writing files outside of the browser file sandbox was impossible, and the WebExtensions standard did not exist (Firefox was still using XUL-based add-ons).

Then webRequest was introduced as an experimental API and eventually accessible outside of Canary. One big issue that webRequest mitigated and there may be a more elegant way to accomplish it now, was reading the raw stream/bytes as they came "over the wire". This would have made caching these bytes for writing a lot easier but was not possible at the time. I believe something within the debugging/console API may make the process even easier than using webRequest.

The other issue issue was breaking out of the sandbox for writing. Per the blog post I linked before, accomplishing this initially required a "local server", which was unacceptable for a solution. There was no HTML File API then but eventually some libraries made this process possible as the standards made their way to the browsers.

I look forward to see what you come up with using Electron.

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

@N0taN3rd Thanks for chiming in here. :)

Can you provide some insight into other (the alternative) ways to do it from the browser? That could help guide other potential solutions and you are knowledgeable enough of all-things-JS where your pointers could help ensure the browser's capability's (re:extensions) are fully utilized.

from warcreate.

ikreymer avatar ikreymer commented on August 22, 2024

There's a few options from perspective of a browser, I like the HAR approach as the browser just gives that to you directly, only downside is you have to have DevTools be open.

@hanoii As mentioned in the other issue, Facebook is particularly complicated and requires custom tweaking. We will take a look at tweaking it for the time being, but no guarantee that it won't break again in the future.

I need to keep, if possible, better consumable media. The WARC part is probably the default archiving to the replaying side of things, but it might make sense to have screenshots, annotations, tagging, and then maybe storing it somewhere. Attempting to fetch youtube videos and/or facebook through youtube-dl could come handy, so exploring the option to do that from, at least for now, an electron app at least as an initial PoC tool.

A lot of this is what Webrecorder is also trying to support. The issue with Facebook is not the capture process, but usually the replay/reproducibility of dynamic content that changes on each load. We have been working on this for 5+ years, and Facebook remains difficult, as mentioned in the other issue Rhizome-Conifer/conifer#664

We are also considering options for a desktop/electron WR that is not just a player, but can also do capture, but our resources are limited. If this is something you'd be interested in helping out with, lets chat :)

from warcreate.

ikreymer avatar ikreymer commented on August 22, 2024

More specifically, the issue with FB is that need to 'fuzzy match' requests to responses, and the rules for how that works is changing (by facebook changing their api).

A custom browser solution for capture will not really help with any of that, it needs to be done at request/response lookup time. Here's a (slightly old docs) on how this system works: https://github.com/webrecorder/pywb/wiki/Fuzzy-Match-Rules
Mostly, this system hasn't changed much and we definitely need to add it to the latest docs!

from warcreate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.