Giter Site home page Giter Site logo

extract entire warc file? about warcio HOT 4 CLOSED

catharsis71 avatar catharsis71 commented on June 20, 2024
extract entire warc file?

from warcio.

Comments (4)

wumpus avatar wumpus commented on June 20, 2024

Are you trying to run code on each webpage in the file?

Or do you want to leave all of the webpages on disk somehow?

The first way is what you probably should do. That's just a loop like the one in the documentation:

https://github.com/webrecorder/warcio#warc-and-arc-streaming

The second thing, leaving all of the webpages on disk as separate files, is difficult to do, because how will you name the files? And what happens if the WARC file has two webpages with exactly the same URL?

from warcio.

catharsis71 avatar catharsis71 commented on June 20, 2024

Yes I'm trying to unpack the entire contents to disk similar to warcat's "extract" option, preserving all the directory structure and filenames

Unfortunately warcat apparently has a bug where if it encounters an HTTP header in the WARC that it doesn't like it bombs out with no option to ignore/continue

So I'm exploring other options

but it seems like there's not really anything that does the job

from warcio.

wumpus avatar wumpus commented on June 20, 2024

I suggest that you answer the questions that I asked, then I will implement it for you. If the WARC is from wget for a website that has no CGI arguments weirdnesses, then it's not so bad, but I would like you to be clear if this is sufficient.

from warcio.

wumpus avatar wumpus commented on June 20, 2024

@catharsis71 ?

from warcio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.