Giter Site home page Giter Site logo

pdfanno's Introduction

PDFAnno

PDFAnno is a browser-based linguistic annotation tool for PDF documents.
It offers functions for annotating PDF with labels and relations.
For natural language processing and machine learning, it is suitable for development of gold-standard data with named entity spans, dependency relations, and coreference chains.

If you use PDFAnno, please cite the following paper:

Hiroyuki Shindo, Yohei Munesada and Yuji Matsumoto,
"PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents",
In Proceedings of LREC, 2018.

It is highly recommended to use the latest version of Chrome. (Firefox will also be supported in future.)

Installation

If you install PDFAnno locally,

git clone https://github.com/paperai/pdfanno.git
cd pdfanno
npm install
cp .env.example .env

Then, edit .env as you like.
The default values are:

SERVER_PORT=1000

Run Server

npm run server

Usage

  1. Visit the online demo with the latest version of Chrome.
  2. Load your PDF and annotation file (if any). Sample PDFs and annotations are downloadable from here.
  3. Annotate the PDF as you like.
  4. Save your annotations via button.
    If you continue the annotation, respecify your directory via Browse button to reload the PDF and anno file.

For security reasons, PDFAnno does NOT automatically save your annotations.
Don't forget to download your current annotations!

Annotation Tools

Icon Description
Span highlighting. It is disallowed to cross page boundaries.
One-way relation. This is used for annotating dependency relation between spans.
Rectangle. It is disallowed to cross page boundaries.

Annotation File (.anno)

In PDFAnno, an annotation file (.anno) follows TOML format.
Here is an example of anno file:

pdfanno = "0.4.1"
pdfextract = "0.2.4"

[[spans]]
id = "1"
page = 1
label = "label1"
text = "AgBi 0.05 Sb 0.95 Te 2"
textrange = [1422,1438]

[[spans]]
id = "2"
page = 1
label = "label1"
text = "0.48 Wm [NO_UNICODE] 1 K [NO_UNICODE] 1 )"
textrange = [1386,1397]

[[relations]]
head = "1"
tail = "2"
label = "relation1"

where textrange corresponds to the start and end token id of pdftxt.
pdftxt is a text file extracted from the original pdf file.
You can download pdftxt via pdf.txt button at the top right of the screen.

Reference Anno File

To support multi-user annotation, PDFAnno allows to load reference anno file.
For example, if you create a.anno and an another annotator creates b.anno for the same PDF, load a.anno as usual, and load b.anno as a reference file. Then PDFAnno renders a.anno and b.anno with different colors each other. Rendering more than one reference file is also supported.
This is useful to check inter-annotator agreement and resolving annotation conflicts.
Note that the reference files are rendered as read-only.

Contact

Please contact hshindo or feel free to create an issue.

LICENSE

MIT

pdfanno's People

Contributors

yoheimune avatar hshindo avatar navinisoft avatar takahirohorie avatar kmamiya avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.