Giter Site home page Giter Site logo

wtabhtml's Introduction

WTabHTML: HTML Wikitables extractor

Input:

  • Wikipedia HTML dump
  • Language

Output:

File format: JSON list. Each line is a json object of

{
    title: wikipedia title
    wikidata: wikidata ID
    url: the url that link to Wikipedia page
    index: the index of table in the Wikipedia page
    html: html content of table
    caption: table caption
    aspects: (Hierachy sections of Wikipedia)  
}

Usage:

Download, Extract, and dump wikitables in EN language

python wtabhtml.py download -l en

Download, Extract, dump wikitables, and generate table images in CR language

python wtabhtml.py gen-images -l cr -n 3

Note: User can download our preprocessed dumps then, copy all {LANGUAGE}.jsonl.bz2 (the wikitables dump in PubTabNet format) to wtabhtml/data/models/wikitables_html_pubtabnet to generate photo images faster.

If user want to re-run all pipeline, the tool will download Wikipedia HTML dump, extract wikitables, and dump it to wtabhtml/data/models/wikitables_html_pubtabnet\{LANGUAGE}.jsonl.bz2 file as the following pipeline.

Pipeline of Wikitable processing in cr language

# Download dump
python wtabhtml.py download -l cr
# Parse dump and save json file
python wtabhtml.py parse -l cr
# Read dump
python wtabhtml.py read -l 1 -i ./data/models/cr.jsonl.bz2
# Generate images
python wtabhtml.py gen-images -l cr -n 3

Contact

Phuc Nguyen ([email protected])

wtabhtml's People

Contributors

phucty avatar cirenehc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.