Giter Site home page Giter Site logo

qutrino / mondrian Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hpi-information-systems/mondrian

0.0 0.0 0.0 12.39 MB

Code repository for Mondrian, a project for multiregion template recognition in spreadsheets.

License: GNU General Public License v3.0

Python 100.00%

mondrian's Introduction

Detecting Multiregion Templates with Mondrian

Code repository of the Mondrian project, developed at the Information Systems Group of the Hasso Plattner Institute.

Setup

The code is written using Python 3.8. If using a local (or virtual) environment, install dependencies with pip install -r requirements.txt Alternatively, if using a conda distribution on Linux, use: conda env create --file mondrian.yml

Basic script

The basic.py script can be used to detect region boundaries in two csv files and compare their layout similarity. From the command line, type:

python3 basic.py

The script will ask the user for the location of two comma delimited files. Optional parameters that can be provided to this script are the hyperparameters for the multiregion detection phase (alpha, beta, gamma and the radius), as well as a custom delimiter for the .csv files (default is comma) and the option to print images of the detected regions/graph layouts.

For the full list of parameters, type:

python3 basic.py --help

Multiregion detection experiment

The region_detection.py scripts can be used to run the experiments on the multiregion detection component of Mondrian. It requires the input datasets of csv files to be stored in a folder named "./res/{dataset-name}/csv" and their annotations in the folder "./res/{dataset-name}/annotations". The dataset name can be specified as a command line argument. The alpha, beta, gamma hyperparameters can be configured with command line parameters. An additional argument allows to execute the evaluation of the results and print the results.

The script allows different configurations to be tested, selecting them with command line arguments. To run region detection using the connected component baseline and see the evaluation results, run:

python3 region_detection.py --baseline --evaluate

To test using a static radius R, run:

python3 region_detection.py --static R --evaluate

To test using a dynamic, optimal radius, run:

python3 region_detection.py --dynamic --evaluate

For the full list of parameters, type:

python3 region_detection.py --help

Template inference experiment

The template_recognition.py script can be used to run the experiments on the multiregion inference component of Mondrian.

Like region_detection.py, this script requires the target regions to be in a file named "annotations_elements.json" inside the annotation folder as well as the gold standard templates file named "annotations_templates.json". Additionally, it requires results produced from the multiregion detection script. The input files are read from the "res/{dataset-name}/csv" folder and the output of the scripts are read/written to the "result" folder. The dataset name can be specified as a command line argument. The script allows to set the desired hyperparameters for the multiregion detection as well as the thresholds for region and template similarity.

Following the same names used in the region_detection.py script, the script must be run with an --experiment parameter to select the corresponding results computed by the multiregion detection stage. To run template inference using the connected component baseline run:

python3 template_search.py --experiment baseline

To test using a static radius R, run:

python3 template_search.py --experiment static --r R --evaluate

To test using a dynamic, optimal radius, run:

python3 template_search.py --experiment dynamic

For the full list of parameters, type:

python3 template_search.py --help

Datasets

The folder "res" in the dataset contains two annotated dataset of spreadsheet files: DECO and FUSTE. Their respective folders contain the files in .csv format and the annotations in .json format.

The file "annotation_elements.json" contains the file-level annotations for region boundaries in the form:

{
    "file1": {
        "n_regions": int,
        "regions": [
            {
                "region_label": string,
                "region_type": string,
                "top_lx": [int,int], #the top-left coordinate of region boundary
                "bot_rx": [int,int], #the bot-right coordingate of the region boundary
              ...
              }]},
      "file2": ...
}

The file "annotations_template.json" contains dataset-level annotations of templates in the form:

{ "template_name": ["file1", "file2", ...] }

mondrian's People

Contributors

vitaglianog avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.