Giter Site home page Giter Site logo

readingbank2paragraph's Introduction

ReadingBank2Paragraph

This is a byproduct of a university project. The project itself was about document analysis based on a documents' layout.

How it works

The ReadingBank dataset consists of boxes which are representing words. Those are clustered into lines or rows. Finally, those rows are clustered again into paragraphs using agglomerative clustering.

In order to find the closest boxes (which representing lines) a customized manhattan-distance based distance is used. This distance calculates the closest distance between two boxes. If they overlap in one dimension (x- or y-axis) the distance is 0.

Since multiple lines (horizontally and vertically) can form a single paragraph two thresholds are used. You can find and change those the config.py-file. Namely, they are DISTANCE_THRESHOLD and CLUSTER_THRESHOLD.
While DISTANCE_THRESHOLD is used as threshold for the distance in x-axis direction, CLUSTER_THRESHOLD is used for the distance in y-axis direction.

How to use this code

Installation

  1. pip install -r requirements.txt
  2. cd src/
  3. python example.py

Usage

You can find in example.py the intended usage. The lines in focus are:

clustered_document = process(readingbank_document)
visualize_readingbank_document(
    doc=clustered_document,
    paragraphs_affiliation=clustered_document["paragraph_class_per_line"],
)

This produces a plot like this one: visualization of clustering a ReadingBank document into lines/row and paragraphs

This Repository's intended use is to work with documents which are given in the format of the ReadingBanks format. This includes (only) the layout information. If you want to use an arbitrary document, make sure it is a json-object containing the keys: src and tgt.
Examples are given in the example.jsonl-file.

Questions?

Feel free to open an issue, if you need guidance or want to discuss things.

readingbank2paragraph's People

Contributors

blos avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.