Hey guys, I am trying to replicate the results of a paper <a href="https://github.com/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

warcio.exceptions.ArchiveLoadFailed: Unknown archive format about warcio HOT 3 CLOSED

KyloPrem commented on June 20, 2024

warcio.exceptions.ArchiveLoadFailed: Unknown archive format

from warcio.

Comments (3)

wumpus commented on June 20, 2024 1

@KyloPrem so just to be clear, you closed this because the actual issue is receiving a 403 rate limit result, and not checking the http status. We did make a change that causes this recently, and honestly, I didn't even think of that when I read your bug report! Glad you figured it out.

from warcio.

wumpus commented on June 20, 2024

From the first line printed in the exception, you can see that the file you're asking ArchiveIterator to work on is xml.

ArchiveIterator iterates over warc files. Not xml.

from warcio.

KyloPrem commented on June 20, 2024

That is interesting. The data .tsv definitely references the common crawl archives.
https://commoncrawl.org/2016/07/
crawl-data/CC-MAIN-2016-07/segments/1454702018134.95/warc/CC-MAIN-20160205195338-00121-ip-10-236-182-209.ec2.internal.warc.gz

I tried to build a minimal example below, which gave me the above exception

import io
import time
import justext
import argparse
import requests
import pandas as pd
from tqdm import tqdm
from warcio.archiveiterator import ArchiveIterator


def download_debug():

    common_crawl_data = {"filename":"crawl-data/CC-MAIN-2016-07/segments/1454702018134.95/warc/CC-MAIN-20160205195338-00121-ip-10-236-182-209.ec2.internal.warc.gz",
                         "offset":244189209,
                         "length":989
                         }

    offset, length = int(common_crawl_data['offset']), int(common_crawl_data['length'])
    offset_end = offset + length - 1

    prefix = 'https://commoncrawl.s3.amazonaws.com/'

    resp = requests.get(prefix + common_crawl_data['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})
    raw_data = io.BytesIO(resp.content)

    uri = None
    page = None
    
    for record in ArchiveIterator(raw_data, arc2warc=True):
        uri = record.rec_headers.get_header('WARC-Target-URI')
        R = record.content_stream().read()
        try:
            page = R.strip().decode('utf-8')
        except:
            page = R.strip().decode('latin1')
        print(uri, page)
    return uri, page

download_debug()

any recommendation on how to debug this or an idea why this function would return a xml instead of the warc file?

EDIT: the request returns a 403 denied message.

from warcio.

Recommend Projects

warcio.exceptions.ArchiveLoadFailed: Unknown archive format about warcio HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent