cboulanger / excite-docker Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 125.86 MB

Docker image with tools for the annotation of ML training docs for reference extraction based on the EXparser tools

Home Page: https://cboulanger.github.io/excite-docker

License: GNU General Public License v3.0

Dockerfile 0.07% Python 22.59% Shell 0.87% HTML 3.96% JavaScript 66.96% CSS 5.55%

citation-mining exparser

excite-docker's Introduction

EXcite-Docker: Tool for the annotation of training material for ML-based reference extraction and segmentation

Note: This repository is no longer maintained. Current work has shifted to https://github.com/cboulanger/anystyle-workflow

This is a docker image that provides a web application to produce training material for two ML-based reference extraction & segmentation engines:

Both serve to extract citation data from PDF Documents.

The image provides a Web UI for producing training material which is needed to improve citation recognition for particular corpora of scholarly literature where the current models does not perform well; and provides a CLI to run the EXcite commands, manage multiple sets of model training data and model data, and support an evaluation workflow that can measure the performance of the model. The AnyStyle toolkit has its own CLI and evaluation built in. Currently, support for AnyStyle is limited to editing .ttx documents, more comprehensive integration will follow.

The code has been forked from https://git.gesis.org/hosseiam/excite-docker, but there is little of the original code left except the core EXparser algorithm.

A demo of the web frontend (without backend functionality) is available here.

Installation

Install Docker
Clone this repo with: git clone https://github.com/cboulanger/excite-docker.git && cd excite-docker
Build docker image: ./bin/build
If you want to use AnyCite, please consult its GitHub page on how to install it: https://github.com/inukshuk/anystyle

Use of the web frontend

Run server: ./bin/start-servers
Open frontend at http://127.0.0.1:8000/web/index.html
Click on "Help" for instructions (also lets you download the Zotero add-ons)

CLI

You can control the extraction and segmentation process via the CLI. CLI commands can be executed with ./bin/run <command> Available commands can be listed with ./bin/run --help, and you can always get detailed help on each command with ./bin/run <command> [<subcommand>] --help

The main commands for extracting references from PDFs are:

layout: run layout analysis of any PDF file in Data/1-pdfs and put the result into Data/2-layout
exparser: process all the files in Data/2-layout. The output will be provided in csv (plain text), xml and BibTex format in the directories Data/3-refs, Data/3-refs_seg and Data/3-refs_bibtex
segmentation: process the references that are in the csv files in the Data/3-refs directory and output xml and BibTex files in Data/3-refs_seg and Data/3-refs_bibtex

For more CLI commands, see the sections below.

Training a new model

If you want to use this feature, you need to have git-lfs installed before you check out this repository. git-lfs is necessary to download the large files that are used during training.

In order to train a new model from scratch, you need to do the following:

Run ./bin/run model create <model_name>
Put the PDFs with which you are going to train the model into Data/1-pdfs if they are native PDFs or contain an OCR layer. If the PDFs consist of scanned pages without the OCR layer, put them into 0-pdfs_no_ocr and wait for the OCR server to process them and move them to Data/1-pdfs
Create the layout files with ./bin/run layout
Move files from Data/2-layout into Dataset/<model_name>/LYT
Load the web application and choose your new model from the "Model" dropdown
Use the web application to load and annotate the layout files from Dataset/<model_name>/LYT in the identification and segmentation views. Here is more information on training the reference extraction model and the the reference parsing model.
"Save" the training files after each annotation, they will be stored in the model directory
On the command line, run ./bin/train <model_name>. If you want to train extraction, segmentation and model completeness separately, use ./bin/run train extraction <model_name>, train segmentation <model_name> or train completeness <model_name>

Training data lives in the Dataset/<model_name> folder. For details, see here.

For training, you need to populate the following folders with training data:

Dataset/<model_name>/LYT/ - layout files
Dataset/<model_name>/LRT/ - layout files with annotation for references <ref>
Dataset/<model_name>/SEG/ - segmentation data for citations

To run the training, execute ./bin/training <model_name>.

This will generate data in the following folders:

#feature extraction output
Dataset/<model_name>/Features/
Dataset/<model_name>/RefLD/

#model training output
Models/<model_name>/SMN.npy
Models/<model_name>/FSN.npy
Models/<model_name>/rf.pkl - the model

You can list all existing models with bin/run model list and delete a model with bin/run model delete <model_name>.

WebDAV-based model package repository

You can store model and training data on a WebDAV server, which is particularly useful for sharing data and collaborative training. To enable this, rename /.env.dist to .env and configure the required environment variables.

The available CLI commands can be listed with bin/run package --help. To upload training or model data to the WebDAV server, you can use the package publish command, which has the following syntax

bin/run package publish --help
usage: package publish [-h] [--model-name MODEL_NAME] [--trained-model]
                    [--training-data {extraction,segmentation,all}]
                    [--overwrite]
                    package_name

positional arguments:
  package_name          Name of the package in which to publish the model data

optional arguments:
  -h, --help            show this help message and exit
  --model-name MODEL_NAME, -n MODEL_NAME
                        Name of the model from which to publish data. If not
                        given, the name of the package is used.
  --trained-model, -m   Include the trained model itself
  --training-data {extraction,segmentation,all}, -t {extraction,segmentation,all}
                        The type of training data to include in the package
  --overwrite, -o       Overwrite an existing package

The package_name is an arbitrary string which should express the content of the package, plus ideally a timestamp, such as foo-segm-train-data-20220502 or foo-model-data-20220502_075523. You can choose to upload training data with the --training-data option, which takes either "extraction", "segmentation", or "all" for both. To share the trained model itself, use --trained-model. Since the model files are large, this will add significantly to the size of the package and to the time it takes to upload and download the model data. On the other hand, this saves the time for training the model with the training data first.

You can then later bin/run package import <package_name> to import the package contents into a model with the same name, which is created if it does not exist. If you want to import the package contents into a differnt model, specify its name with the --model-name option.

Display the list of remotely stored packages with bin/run package list and delete a package with bin/run package delete <package_name>.

Evaluating the performance of a model

To measure the accuracy of a model, we support the following split - train - eval workflow via scripts that use the CLI commands.

bin/split foo foo_split: The training data of a model "foo" is split into 80% training data and 20% evaluation data and moved into a newly created model "foo_split":
bin/train foo_split: The model is trained with its training data
bin/eval foo_split: Extraction and segmentation is run on the evaluation data and the result is evaluated against the known gold standard.
bin/run report foo_split prints the accuracy data to the console (it can also output it to a csv file)

This workflow can be further automated with the bin/split_train_eval <model_name> script, which runs these commands in sequence.

In order to compare the performance of two models, you can use the bin/compare <model1> <model2> command, which will automatically make a split copies of the models and add a third model which combines the training data of both models.

Use different versions of the EXparser engine

In order to be able to compare the performance of different versions of the main EXparser extraction and segmentation engine, the engine can be dynamically switched (since v0.2.0). You can install an engine version with bin/run engine install <version> and use it with bin/run engine use <version>. A list of installed engines is available with bin/run engine list. Any commit tag on github can be used as a version (including branches and PRs) or the released versions listed at https://github.com/cboulanger/excite-docker/releases (except version 0.1.0, which is not compatible).

Zotero support

Zotero integration is currently not supported because the plugin providing the required API is not compatible with Zotero 6. A native API is planned by the Zotero Devs.

If a Zotero with the appropriate add-ins is running, the webapp will enable additional commands that let you retrieve the PDF attachment(s) of the currently selected item/collection, extract references from them and store them with the citing item.

If the Zotero storage folder is not located in ~/Zotero/storage, you need to rename .env.dist to .env and in this file, set the ZOTERO_STORAGE_PATH environment variable to the path pointing to this directory.

excite-docker's People

Contributors

Watchers

excite-docker's Issues

Fail more gracefully if font issue is missing

  File "/app/run-main.py", line 171, in <module>
    call_extraction_training(sys.argv[2])
  File "/app/run-main.py", line 119, in call_extraction_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Ext.py", line 52, in train_extraction
    row2 = reader2[uu]
IndexError: list index out of range

this error occurs if old training material is used which doesn't have the font columns

Separate backend for extraction and segmentation

Have a "segmentation" command that runs using the content of the 3-refs directory.

Auto-segmentation deleting first reference

Dear contributors, I noticed an issue while comparing the displayed lists of references before and after using the Auto-segmentation. The aforementioned function seems to regularly delete the first reference of the extracted list.

RuntimeWarnings during Feature Extraction

/app/EXparser/src/gle_fun_ext.py:139: RuntimeWarning: invalid value encountered in true_divide
lh2 = 1.0 * lh / sum(lh)
/app/EXparser/src/gle_fun_ext.py:141: RuntimeWarning: invalid value encountered in true_divide
lh = 1.0 * np.array(lh) / sum(lh)

Allow switching of models with optional remote model repository

In order to be able to use specialized models for different kind of scholarly citation patterns, we should make the directory containing model data (now EXparser/Utils) configurable. The idea is to give such a specialized model a unique name which serves both as an well-known id and the name of the directory in which the models are stored. Since the model data is directly dependent on the training code, it needs to be versioned. This also allows to run tests comparing the performance of a particular model with the same id but different versions (for example, by running an evaluation comparing performance of different git branches).

Models is stored in EXparser/Models/<version>/<model_id>. EXparser/Utils/ is renamed to EXparser/Models/<version>/default. The version number is hardcoded in configs.py and manually incremented whenever a change is made in the EXparser code that renders the model data backward-incompatible to previous code versions.
Since different models will have different training material (the whole point of having separate models), EXparser/Dataset needs to be renamed to EXparser/Datasets/default. The training material folders do not need to be versioned.
A new commanddocker run ... excite_toolchain create_model <model_id> is added which creates a directory EXparser/Models/<version>/<model id> and copies over the non-reproduceable model files (if there are any left). It returns a message saying that the user needs to add training material to EXparser/Datasets/<model_id> and to run training.
The model is selected when running the docker commands, such as docker run ... excite_toolchain exparser <model_id>. If no model name is supplied, "default" is used as the model id. If the model id does not exist, an error is raised saying that the command create_model must be run first.
docker run ... excite_toolchain (segmentation|extraction)_training <model_id> computes the models from the training material in EXparser/Datasets/<model_id> and places them into EXparser/Models/<version>/<model_id>.

When we have this system in place, an optional storage system can be build upon it. It works with packages that are a ZIP of the training material and model data stored in a configurable location.

A new commanddocker run ... excite_toolchain download_model <model_id> is added which tries to download /excite-docker/<version>/<model_id>.zip from a WebDAV server (url and credentials are supplied as environment variables). If that is successful, the ZIP is extracted and placed into the training and model directories corresponding to the version and model id.
A new command docker run ... excite_toolchain upload_model <model_id> is added, which uploads the training and model data as a ZIP to the WebDAV folder
A new command docker run ... excite_toolchain list_models is added, which returns a list of models stored at the given repository compatible with the current version

Tag-Fragments during Segmentation

Dear contributors, I encountered two cases of tag-fragments during segmentation that I could not edit, delete or interact with in any way.

After running the Auto-segmentation, the function tends to leave such a fragment of the Last Page tag when the pages in the reference are declared in the format “S. Start Page – “. The fragment occurs after the hyphen.
Sometimes when using the function Correct selected text to add new signs and part of the selected text is already tagged, a fragment of the included tag is left next to the newly added signs. I noticed this multiple times but wasn’t able to find a reliable way to replicate it unfortunately.

You can find examples of these two cases in the screenshots I added.

Ignore files starting with "."

Layout analysis and exparser should ignore all files starting with a dot (".") so that .gitkeep and other git files won't be analyzed

IndexError: string index out of range during segmentation

 File "/app/run-main.py", line 174, in <module>
    call_segmentation_training(sys.argv[2])
  File "/app/run-main.py", line 125, in call_segmentation_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Seg.py", line 55, in train_segmentation
    train_feat[len(train_feat) - 1].extend([word2feat(a, stopw, 2, len(ln), b1, b2, b3, b4, b5, b6)])
  File "/app/EXparser/src/gle_fun_seg.py", line 378, in word2feat
    feat.update(get_last(w))
  File "/app/EXparser/src/gle_fun_seg.py", line 281, in get_last
    c = w[-1] * 2
IndexError: string index out of range

Make titles of bibliography sections configurable

this function is not only spelled wrong, it also hardcodes the names of possible bibliography sections - the titles need to be put into a configurable list

Handling of ders. in German citations

Add a rule for "use the last author name previously recognized" when "ders." is encountered

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.