Giter Site home page Giter Site logo

Working status, how does it work? about warcreate HOT 9 OPEN

hanoii avatar hanoii commented on August 22, 2024
Working status, how does it work?

from warcreate.

Comments (9)

hanoii avatar hanoii commented on August 22, 2024

EDIT, also related to #112 .

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

Hi @hanoii, I just pushed a new version of WARCreate due to the Chrome Web Store stating a compliance issue. This has happened before, seems automated, and prone to false positives. Given the latest version prior was from 2017, this version should have some improvements in capture quality.

To answer your questions (I hope):
WARCreate is currently site-agnostic. It does not capture sites but captures pages. These pages may be content behind authentication, whose payloads are stored similarly to "surface web" content.

What do you mean that pywb/webrecorder-player gives you a "record"?

The primary use case is that while browsing the Web, you should be able to click the icon, click the "generate WARC" button and, potentially after a short delay to amalgamate the resource representations, have a WARC downloaded to your local file system.

WARCreate uses an anticipatory model, collecting the representations in a cache in your browser until you browse to another page, at which point it is cleared and re-generated for the current page. If you choose to generate a WARC of the page, this cache is partially uses as the basis for WARC creation. For representations like the current HTML page, this is captured at the time the button is pressed, as a cached version would likely be stale depending on if the DOM was manipulated between page load and button push.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

@machawk1 me again. I have been exploring a different approach with electron and I was happy where it was heading but it seems they are worried about their end users not wanted to use a separate browser so I am now more focused into a chrome extension approach, or maybe some kind of communication between a chrome extension and an electron app.

Anyway, I went again to give this another spin, but I am still failing to get simple warc out of it (I mean ones that actually renders on webrecorder play for instance).

I tried drupal.org unsuccessfully. I have the following version installed (tried removing and re-adding).

screen shot 2019-02-05 at 16 10 14

Am I missing something?

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

@hanoii I was able to generate a WARC from drupal.org but it is somewhat problematic with respect to replay in pywb and webrecorder player. It is likely an issue with strict validation of the file being produced from WARCreate. I will need to look into it.

The approach of using a browser extension and the user's own browser is novel to WARCreate and rightfully so -- it's a tough task, especially when no WARC libraries (when WARCreate was written), to cache and save everything accessible from the browser API (and not over the wire) to the local system.

I mentioned WAIL(Electron) in #112. This is the port to Electron of a native app written in Python that I originally wrote to mitigate some barriers in WARCreate. Namely, it would communicate to WAIL directly. Your desire to have an Electron program be in the loop is somewhat reminiscent of this and feasible but both parties (the extension and Electron app) should be receptive to the process. Instead of an ad hoc approach (which is far, far easier), I was hoping to eventually utilize the WASAPI API for WARC "transfer". This would tools be a bit more interoperable.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

Are those limitations still there as far as you can tell, or if you were to rewrite some parts of it you think there are better alternatives? I have to yet look in depth at your code, but I see you cannot access raw data unless you use the network dev tools extension, but that is a devtool only extension, which could be an option to also consider.

I spoke with the webrecorder guys. I successfully used https://github.com/N0taN3rd/node-warc with a custom easy browser on electron to render a drupal.org and it worked quite nice, but know using a chrome extension is almost mandatory.

I was able to communicate easily between a chrome and an electron app, and now this other app coudl not even be electron but rather just a node app making it easier, but would still like to do warc generation from the browser for your exact same reason you state on your project page.

If we go this route I will probably dive much deeper on this extension.

I also tried facebook (knowing it's a hard site) and I got no warc at all from it, not sure if i have to wait a lot more but waited quite a bit.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

@machawk1 on top of the questions above I was looking a bit more into the code this morning. I might need to go over it more in depth but I see you are re-fetching css/js/images so that you can get its data, correct? that's the cache you mentioned? So you are storing requests information but mostly refetching everything you don't have the data?

How would XHR or AJAX requests be handled? would you also refetch those?

And I've been looking and it seems there's still no other way of getting the raw data of the request unless you do it externally through CDP or within a chrome dev extension. Right?

from warcreate.

machawk1 avatar machawk1 commented on August 22, 2024

@hanoii webRequest allows WARCreate to read some headers and payload when they come over the wire. I believe due to some synchronicity issues, there was a need (as implemented) to refetch some resources based on the payloads "missing" as analyzed when the WARC is being created.

I should note that WARCreate should maintain a privileged trait with regard to AJAX and CORS. Normally, fetching resources in this way would cause the request to be rejected from the server hosting the resource. Also note that at one point we tried moving from XHR to Fetch but the latter was limited in what headers could be read from the response, so that information would be unavailable to be included in the WARC. Hence, you will see a bit of XHR in the code instead of the modern alternative.

Using devtools would make the job easier. The API did not exist when I initially created WARCreate but after it was introduced, with a cursory analysis and the advice of @N0taN3rd, we found that it needed to be "open" to be accessible. I am unsure if this is still the case but if not, would be open to explore using devtools if it gives a more comprehensive ability to capture what's coming over the wire. Having the raw data would be ideal but WARCreate currently attempts to account for the inability to do so at the time.

Another good thing to have would be a means of evaluating the extension. I have had reports of "it does not work on site X" but the reason is rarely distilled to be debuggable. Some sample, hosted Web pages that isolate a problematic feature would help to be able to isolate shortcomings. Having these hosted in a predictable environment (e.g., a test suite consisting of fundamental features on GitHub pages) would be helpful.

from warcreate.

hanoii avatar hanoii commented on August 22, 2024

I should note that WARCreate should maintain a privileged trait with regard to AJAX and CORS. Normally, fetching resources in this way would cause the request to be rejected from the server hosting the resource. Also note that at one point we tried moving from XHR to Fetch but the latter was limited in what headers could be read from the response, so that information would be unavailable to be included in the WARC. Hence, you will see a bit of XHR in the code instead of the modern alternative.

I am not sure I understood this. I was wondering what do you do on XHR or AJAX requests to store that data on the Warc, are those also re-fetched or is that payload actually on the webRequest API.

I recently worked on this site: www.moogmusic.com, it's an angular site with a rest interface so although probably complex to capture, it's predictable in the sense there's no random query strings appended or anything. A quick try on the extension also doesn't work with replaying afterwards on webrecorder player, and looking at the devtools of the players I see missing request on the REST resource.

Using devtools would make the job easier. The API did not exist when I initially created WARCreate but after it was introduced, with a cursory analysis and the advice of @N0taN3rd, we found that it needed to be "open" to be accessible. I am unsure if this is still the case but if not, would be open to explore using devtools if it gives a more comprehensive ability to capture what's coming over the wire. Having the raw data would be ideal but WARCreate currently attempts to account for the inability to do so at the time.

I believe it still has to be open, but it could potentially be a good compromise if it really helps.

Another good thing to have would be a means of evaluating the extension. I have had reports of "it does not work on site X" but the reason is rarely distilled to be debuggable. Some sample, hosted Web pages that isolate a problematic feature would help to be able to isolate shortcomings. Having these hosted in a predictable environment (e.g., a test suite consisting of fundamental features on GitHub pages) would be helpful.

Have you had any success in capturing twitter/facebook? I know now that replaying what's captured on those sites is complex on its own.

I don't know yet how to distill an issue from a site unless I debug the extension extensively. But as mentioned above I tried also lanacion.com.ar and that one also didn't replay properly.

But if I find something concrete I will sure let you know.

The one thing I am mostly worried about is the real feasibility of creating a warc from a regular extension. You seem to have done a great job overcoming some of the issues but I wonder if others are simply not possible to be done within the extension. Is your current feeling that there should be a way to create a fully valid ward out of every interaction of the site and the server from a regular extension?

from warcreate.

payingattention avatar payingattention commented on August 22, 2024

I couldn't open it though with https://github.com/webrecorder/webrecorder-player or https://github.com/webrecorder/pywb, it gives a record.

I can confirm that saving by "Uploading [WARC] to Collection" with https://conifer.rhizome.org/037 (https://WebRecorder.io/037 changed name) your WARC file only get to 50% and then says "Error Encountered".

Rhizome-Conifer _ #Accounts (Web archive collection for user @037)

The same https://twitter.com/prosodyContext page saved with the https://WebRecorder.net https://github.com/webrecorder/webrecorder-desktop ".AppImage" binary works (I do not mean to make you compete, multiple tools is important). I can provide the WARCreate WARC archive(s) in question if you ask.

Should I make a separate support issue ticket or is here appropriate/good?

from warcreate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.