Giter Site home page Giter Site logo

wikitext-table-parser's Introduction

WikiText Table Parser

A WikiText table parser written in Rust.

We also have a binding for Python.

What is this project for ?

WikiText is a special format used by wikipedia, most available wiki data or processing tool ignore the table data. This project implement a table parser that help one to processing the table in wikitext or wiki-dump.

A table in wikitext should like:

{| class="wikitable"
|+ Caption text
|-
! Header text !! Header text !! Header text
|-
| Example || Example || Example
|-
| Example || Example || Example
|-
| Example || Example || Example
|}

also see the reference of wikitext table for more detail.

Documentation

Rust

Installation

[dependencies]
wikitext_table_parser = "0.3.0"

Usage Example

use std::env;
use std::fs::File;
use std::io::Read;
use wikitext_table_parser::parser::{Event, WikitextTableParser};
use wikitext_table_parser::tokenizer::{
    get_all_cell_text_special_tokens, get_all_table_special_tokens, Tokenizer,
};

fn main() {
    let args: Vec<String> = env::args().collect();
    let file_path = args[1].clone();

    // Attempt to open the file
    let mut file = match File::open(file_path) {
        Ok(file) => file,
        Err(_) => {
            eprintln!("Error opening the file.");
            return;
        }
    };

    // Read the contents of the file into a String
    let mut content = String::new();
    if let Err(_) = file.read_to_string(&mut content) {
        eprintln!("Error reading the file into a string.");
        return;
    }
    let table_tokenizer = Tokenizer::build(get_all_table_special_tokens());
    let cell_tokenizer = Tokenizer::build(get_all_cell_text_special_tokens());
    let wikitext_table_parser =
        WikitextTableParser::new(table_tokenizer, cell_tokenizer, &content, true);
    for event in wikitext_table_parser {
        match event {
            Event::TableStart => {
                println!("Table START!");
            }
            Event::TableStyle(table_style) => {
                println!("table style{:?}#", table_style);
            }
            Event::TableCaption(text) => {
                println!("table name{:?}#", text);
            }
            Event::RowStyle(row_style) => {
                println!("----- {:?} -----", row_style);
            }
            Event::ColStyle(col_style) => {
                print!("col style: {:?} -- ", col_style);
            }
            Event::ColEnd(text) => {
                println!("col data: {:?}", text);
            }
            Event::TableEnd => {
                println!("Table END!");
            }
            _ => {}
        }
    }
}

Python

Installation

  1. Download the wheel file from release

  2. Install the wheel

pip install wikitext_table_parser-xxx.whl

Usage Example

import sys
from wikitext_table_parser import (
    WikitextTableParser,
    Tokenizer,
    Event,
    get_all_table_special_tokens,
    get_all_cell_text_special_tokens
)

table_tokens = get_all_table_special_tokens()
cell_tokens = get_all_cell_text_special_tokens()

table_tokenizer = Tokenizer(table_tokens)
cell_tokenizer = Tokenizer(cell_tokens)

test_case = open(sys.argv[-1]).read()

parser = WikitextTableParser(table_tokenizer, cell_tokenizer, test_case, True)
print(parser.tokens)

while (len(parser.tokens) > 0):
    parser.step()

for event in parser.event_log_queue:
    if isinstance(event, Event.TableStart):
        pass
    elif isinstance(event, Event.TableStyle):
        print("table style:", event.text)
    elif isinstance(event, Event.TableEnd):
        pass
    elif isinstance(event, Event.ColStart):
        print("col type:", event.cell_type)
    elif isinstance(event, Event.ColStyle):
        print("col style:", event.text)
    elif isinstance(event, Event.ColEnd):
        print("col data:", event.text)
    elif isinstance(event, Event.TableCaptionStart):
        pass
    elif isinstance(event, Event.TableCaption):
        print("table caption:", event.text)
    elif isinstance(event, Event.RowStart):
        pass
    elif isinstance(event, Event.RowStyle):
        print("row style:", event.text)
    elif isinstance(event, Event.RowEnd):
        print("-"*20)
    else:
        raise NotImplementedError(event)

wikitext-table-parser's People

Contributors

p208p2002 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.