Giter Site home page Giter Site logo

marty-oehme / pubs-extract Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 105 KB

Extract annotations from any pdf document. A plugin for pubs bibliography manager. Mirror of https://git.martyoeh.me/Marty/pubs-extract

License: GNU Lesser General Public License v3.0

Python 100.00%
bibtex pdf pubs pubs-plugin

pubs-extract's People

Contributors

marty-oehme avatar

Watchers

 avatar  avatar

pubs-extract's Issues

Improve explanation of note/pdf-embedded text similarity

Especially the minimum_text_similarity configuration should be improved, if somebody does not understand the way that pdf annotations have both a 'highlighted' text and an appended 'note' text, they will be lost with similarity explanations and the different possible outcomes.

Perhaps a footnote and a little picture to help visualize it or something similar would suffice.

Allow page-sequential annotation extraction in notes

Currently, new annotations always get appended to the end of notes. They do so in the order that they appear in the PDF document (i.e. sequentially).

This option could allow adding annotations in a sequential fashion for the whole note, i.e. a new annotation on page 15 would get situated between the existing annotations for page 13 and page 18.

Requires some way of clearly delineating what is an annotation and what is manually added in a note.

Allow custom colors for tags

Currently, we provide 6 colors that annotations can be recognized in:
red green blue yellow purple orange.

Since we are already mapping these color names to exact rgb tuples behind the scenes,
we may as well allow the user access to provide exact tuples to check for custom annotation colors.

A possible configuration file could then look like this:

[[[tags]]]
orange = "important"
blue = "todo"
mygreycolor = "0.5,0.5,0.5:unimportant"

in the form {customcolorname} = "{color_vector}:{tag_name}. This form would preserve the old settings structure but be a little awkward with no purpose for the color name given.

A more streamlined form would be:

[[[tags]]]
important = "orange"
todo = "blue"
unimportant = "0.5,0.5,0.5"

which turns tags and mapped color around, making it easily possible to map custom colors but breaking the old configuration format.

Improve annotation duplication detection in notes

Currently, the plugin does a 1:1 exact match for existing annotations in notes.

Since we already have Levenshtein distance calculation we can potentially use this to get a more lenient annotation comparison (e.g. fixed spelling mistakes, or added words missing in auto-extraction).

The hardest part would presumably be to find the range which should be compared to the extracted note - we can not compare the whole document, so which existing annotations should we compare and how do we delimit annotations?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.