Giter Site home page Giter Site logo

warc-parquet's Introduction

warc-parquet

๐Ÿ—„๏ธ A utility for converting WARC to Parquet.

๐Ÿ“ฆ Install

The binary may be installed via cargo:

$ cargo install warc-parquet

To use the crate in your project, add the following to your Cargo.toml file:

[dependencies]
warc-parquet = "0.6.1"

๐Ÿคธ Usage

The Binary

Once installed, the warc-parquet utility can be used to transform WARC into Parquet:

$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.zstd.parquet

warc-parquet is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward:

$ wget --warc-file github 'https://github.com'
$ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.zstd.parquet

It's also simple to preprocess via standard UNIX piping:

$ cat example.warc.gz | gzip -d | warc-parquet > example.zstd.parquet

Various compression options, including the option to forego compression altogether, are also available:

$ cat example.warc.gz | warc-parquet --gzipped --compression gzip > example.gz.parquet

๐Ÿ’ก warc-parquet --help displays complete options and usage information.

The Crate

Refer to the docs for more details about how to use the Reader within your own programs.

DuckDB

There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB:

$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D select type, id from 'example.zstd.parquet';
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   type   โ”‚                       id                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ warcinfo โ”‚ <urn:uuid:A8063499-7675-4D8D-A736-A1D7DAE84C84> โ”‚
โ”‚ request  โ”‚ <urn:uuid:3EB20966-D74F-4949-AACB-23DB3A0733A7> โ”‚
โ”‚ response โ”‚ <urn:uuid:8B92CADC-F770-45BE-8B72-E13A61CD6D1C> โ”‚
โ”‚ metadata โ”‚ <urn:uuid:4C0E9E17-E21B-49E0-859A-D1016FBDE636> โ”‚
โ”‚ resource โ”‚ <urn:uuid:14F502A5-3BDE-4D0B-8A43-95F4BB8398C6> โ”‚
โ”‚ resource โ”‚ <urn:uuid:6B6D6ADD-52FF-4760-AA00-FB9E739CABBE> โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

D describe select * from 'example.zstd.parquet';
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       column_name       โ”‚ column_type โ”‚ null โ”‚ key โ”‚ default โ”‚ extra โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ id                      โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ content_length          โ”‚ UINTEGER    โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ date                    โ”‚ TIMESTAMP   โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ type                    โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ content_type            โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ concurrent_to           โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ block_digest            โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ payload_digest          โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ ip_address              โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ refers_to               โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ target_uri              โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ truncated               โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ warc_info_id            โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ filename                โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ profile                 โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ identified_payload_type โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ segment_number          โ”‚ UINTEGER    โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ segment_origin_id       โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ segment_total_length    โ”‚ UINTEGER    โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ body                    โ”‚ BLOB        โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿฆบ Safety

This crate uses #![forbid(unsafe_code)] to ensure everything is implemented in 100% safe Rust.

๐Ÿ‘ฏ Contributing

We appreciate all kinds of contributions, thank you!

warc-parquet's People

Contributors

dependabot[bot] avatar maxcountryman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

warc-parquet's Issues

thread 'main' panicked at 'range end out of bounds'

Issue Title: thread 'main' panicked at 'range end out of bound


๐Ÿ› Description

  • What did you do?
    I attempted to convert a Common Crawl WARC file to a Parquet file for analysis in DuckDB. I downloaded the file, attempted to process it with warc-parquet, but it wasn't able to correctly/fully process the warc.gz file.

export RUST_BACKTRACE=full; cat CC-NEWS-20230803071746-00285.warc.gz | gzip -d | warc-parquet > CC-NEWS-20230803071746-00285.warc.gz.parquet

  • What did you expect to see?
    A new parquet file with a similar size to the original compressed warc.gz file, around 1 GiB.

  • What did you see instead?
    A new parquet file with only 6,476,466 bytes.


๐Ÿ“Ž Steps to Reproduce

wget https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/08/CC-NEWS-20230803071746-00285.warc.gz
# Length: 1072778533 (1023M) [binary/octet-stream]
# Saving to: โ€˜CC-NEWS-20230803071746-00285.warc.gzโ€™
export RUST_BACKTRACE=full; cat CC-NEWS-20230803071746-00285.warc.gz | gzip -d | warc-parquet > CC-NEWS-20230803071746-00285.warc.gz.parquet
# thread 'main' panicked at 'range end out of bounds: 18446744071562881016 <= 6014243295', /home/niro/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bytes-1.4.0/src/bytes.rs:261:9

๐Ÿ–ฅ System Information

  • Rust Version: rustc 1.72.0 (5680fa18f 2023-08-23)
  • Operating System and Version: Ubuntu 22.04.3 LTS
  • warc-parquet version: warc-parquet 0.4.0

๐Ÿ“ Additional Context

  • Logs, error messages, or panic messages: See attached backtrace.txt
    backtrace.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.