Giter Site home page Giter Site logo

lidhcs / web2text Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dalab/web2text

0.0 1.0 0.0 31.21 MB

Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18

License: MIT License

Scala 0.16% HTML 99.37% Python 0.06% Shell 0.01% JavaScript 0.01% Perl 0.27% Makefile 0.01% TeX 0.12% Smarty 0.01% CSS 0.01% Raku 0.01% SCSS 0.01%

web2text's Introduction

Web2Text

Source code for Web2Text: Deep Structured Boilerplate Removal, full paper at ECIR '18

Introduction

This repository contains

  • Scala code to parse an (X)HTML document into a DOM tree, convert it to a CDOM tree, interpret tree leaves as a sequence of text blocks and extract features for each of these blocks.

  • Python code to train and evaluate unary and pairwise CNNs on top of these features. Inference on the hidden Markov model based on the CNN output potentials can be executed using the provided implementation of the Viterbi algorithm.

  • The CleanEval dataset under src/main/resources/cleaneval/:

    • orig: raw pages
    • clean: reference clean pages
    • aligned: clean content aligned with the corresponding raw page on a per-character basis using the alignment algorithm described in our paper
  • Output from various other webpage cleaners on CleanEval under other_frameworks/output:

Installation

  1. Install Scala and SBT. The code was tested with SBT 1.3.3. You can also use Docker image hseeberger/scala-sbt:8u222_1.3.3_2.13.1.

    • if you struggle installing Scala and SBT, you can run our Scala code in Docker with commands like
    docker run -it --rm \
        --mount type=bind,source="$(pwd)",target=/root \
        hseeberger/scala-sbt:8u222_1.3.3_2.13.1 \
        sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"
    
  2. Install Python 3 with Tensorflow and NumPy.

Usage

See this blog post by Xavier Geerinck with step-by-step instructions on running this code.

Recipe: extracting text from a web page

  1. Run ch.ethz.dalab.web2text.ExtractPageFeatures through sbt. The arguments are:
    • input html file
    • the desired output base filename (script produces {filename_base}_edge_feature.csv and {filename_base}_block_features.csv)
  2. Use the python script src/main/python.py with the 'classify' option. The arguments are:
    • python3 main.py classify {filename_base} {labels_out_filename}
  3. Use ch.ethz.dalab.web2text.ApplyLabelsToPage through sbt to produce clean text. Arguments:
    • input html file
    • {labels_out_filename} from step 2
    • output destination text file path

HTML to CDOM

In Scala:

import ch.ethz.dalab.web2text.cdom.CDOM
val cdom = CDOM.fromHTML("""
    <body>
        <h1>Header</h1>
        <p>Paragraph with an <i>Italic</i> section.</p>
    </body>
    """)
println(cdom)

Feature extraction

Example:

import ch.ethz.dalab.web2text.features.{FeatureExtractor, PageFeatures}
import ch.ethz.dalab.web2text.features.extractor._

val unaryExtractor = 
    DuplicateCountsExtractor
    + LeafBlockExtractor
    + AncestorExtractor(NodeBlockExtractor + TagExtractor(mode="node"), 1)
    + AncestorExtractor(NodeBlockExtractor, 2)
    + RootExtractor(NodeBlockExtractor)
    + TagExtractor(mode="leaf")

val pairwiseExtractor = 
    TreeDistanceExtractor + 
    BlockBreakExtractor + 
    CommonAncestorExtractor(NodeBlockExtractor)

val extractor = FeatureExtractor(unaryExtractor, pairwiseExtractor)

val features: PageFeatures = extractor(cdom)

println(features)

Aligning cleaned text with original source

import ch.ethz.dalab.web2text.alignment.Alignment
val reference = "keep this"
val source = "You should keep this text"
val alignment: String = Alignment.alignment(source, reference) 
println(alignment) // □□□□□□□□□□□keep this□□□□□

Extracting features for CleanEval

import ch.ethz.dalab.web2text.utilities.Util
import ch.ethz.dalab.web2text.cleaneval.CleanEval
import ch.ethz.dalab.web2text.output.CsvDatasetWriter

val data = Util.time{ CleanEval.dataset(fe) }

// Write block_features.csv and edge_features.csv
// Format of a row: page id, groundtruth label (1/0), features ...
CsvDatasetWriter.write(data, "./src/main/python/data")

// Print the names of the exported features in order
println("# Block features")
fe.blockExtractor.labels.foreach(println)
println("# Edge features")
fe.edgeExtractor.labels.foreach(println)

Training the CNNs

Code related to the CNNs lives in the src/main/python directory.

To train the CNNs:

  1. Set the CHECKPOINT_DIR variable in main.py.
  2. Make sure the files block_features.csv and edge_features.csv are in the src/main/python/data directory. Use the example from the previous section for this.
  3. Convert the CSV files to .npy with data/convert_scala_csv.py.
  4. Train the unary CNN with python3 main.py train_unary.
  5. Train the pairwise CNN with python3 main.py train_edge.

Evaluating the CNN

To evaluate the CNN:

  1. Set the CHECKPOINT_DIR variable in main.py to point to a directory with trained weights. We provide trained weights based on the cleaneval split and a custom web2text split (with more training data.)
  2. Run python3 main.py test_structured to test performance on the CleanEval test set.

The performance of other networks is computed in Scala:

import ch.ethz.dalab.web2text.Main
Main.evaluateOthers()

web2text's People

Contributors

tvogels avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.