Giter Site home page Giter Site logo

warcprotocol's Introduction

Code Coverage Nuget

Toimik.WarcProtocol

.NET 6 C# WARC parser and writer.

Features

  • Parses uncompressed / compressed - as a whole or per record - Web ARChive (WARC) version 1.0 / 1.1 files (via WarcParser class)
  • Option to terminate or resume processing upon encountering malformed content
  • Creates WARC records with auto-generated WARC-Block-Digest / WARC-Payload-Digest using configurable hash algorithm
  • Writes uncompressed or per-record compressed WARC version 1.0 / 1.1 files (via WarcWriter class)

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.WarcProtocol

.NET CLI

> dotnet add package Toimik.WarcProtocol

WarcParser Usage

The WarcParser class can be used to parse WARC 1.0 and 1.1 files.

using System.Diagnostics;
using System.Threading.Tasks;
using Toimik.WarcProtocol;

class Program
{
    static async Task Main(string[] args)
    {
        var parser = new WarcParser();

        // Path to a WARC file that contains one or more records, which may be uncompressed or
        // compressed (as a whole or per record)
        var path = "uncompressed.warc";

        // Use a '.gz' file extension to indicate that the file is compressed using GZip
        // var path "compressed.warc.gz";

        // In case any part of the WARC file is malformed and parsing is expected to resume -
        // rather than throwing an exception and terminates - a parse log is specified
        var parseLog = new DebugParseLog();

        // Parse the file and process the records accordingly
        var records = parser.Parse(path, parseLog);

        // Alternatively, use a stream:
        // var stream = ...
        // var isCompressed = ... // true if the stream is compressed; false otherwise
        // var records = parser.Parse(stream, isCompressed, parseLog);

        await foreach (WarcProtocol.Record record in records)
        {
            switch (record.Type)
            {
                case ContinuationRecord.TypeName:

                    // ...
                    break;

                case ConversionRecord.TypeName:

                    // ...
                    break;

                case MetadataRecord.TypeName:

                    // ...
                    break;

                case RequestRecord.TypeName:

                    // ...
                    break;

                case ResourceRecord.TypeName:

                    // ...
                    break;

                case ResponseRecord.TypeName:

                    // ...
                    break;

                case RevisitRecord.TypeName:

                    // ...
                    break;

                case WarcinfoRecord.TypeName:

                    // ...
                    break;

                case "custom-type":

                    // ...
                    break;
            }
        }
    }

    class DebugParseLog : IParseLog
    {
        public void ChunkSkipped(string chunk)
        {
            Debug.WriteLine(chunk);
        }

        public void ErrorEncountered(string error)
        {
            Debug.WriteLine(error);
        }
    }
}

WarcWriter Usage

The WarcWriter class can be used to write WARC records to a file.

using Toimik.WarcProtocol;

class Program
{
    static void Main(string[] args)
    {
        // Creates a new WARC file with per-record compression
        // Wrap the writer in a "using" block to ensure data
        // is properly flushed to the file.
        // Or call "Close()" directly
        using (WarcWriter warcWriter = new WarcWriter("example.warc.gz"))
        {
            warcWriter.Write(warcInfoRecord);
            warcWriter.Write(requestRecord);
            warcWriter.Write(responseRecord);
        }
        
        // You can also create uncompressed WARCs. This is controlled via the file extension.
        using (WarcWriter writer = new WarcWriter("uncompressed.warc"))
        {
            // ...
        }
        
        // You can also force per-record compression, regardless of file extension
        using (WarcWriter writer = new WarcWriter("actually-compressed.warc.whatever", true))
        {
            // ...
        }
    }
}

warcprotocol's People

Contributors

nurhafiz avatar acidus99 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.