aksw / semann Goto Github PK

Semantic Annotation Tool for PDF documents

CSS 1.01% Objective-C++ 1.39% JavaScript 12.64% PHP 0.62% HTML 41.16% Java 43.18%

semann's Introduction

Semantic Annotation Tool for PDF documents

SemAnn is a web-based semantic annotation tool for PDF documents.
SemAnn allows you to semantically annotate (using RDF triples) text in PDFs. These annotations are then used for recommending similar PDF documents that the reader might find relevant.

Project Current state

This project is a prototype.

Currently working features:

Load and render a PDF file within half-page and render other half with custom GUI.
Add annotation
View available annotation of currently loaded documents
Find similar publications

Work in progress:

This branch is concurrently further developed on the following features:

Annotating tables via datacube vocabulary.
Find similar publications functionality.

Documentation

Documentation

Used libraries

PDF.js - Viewer Example is used as a base for the project
Twitter bootstrap - used for UI
jQuery - used for DOM manipulations, required by Twitter bootstrap
Typeahead.js - used for autosuggestion in input boxes
Rangy - A cross-browser JavaScript range and selection library. DBpedia Lookup - looks up DBpedia URIs by related keywords.

Backend Database Used

virtuoso

semann's People

Contributors

Stargazers

Watchers

Forkers

jaanatak matanox alexgarciac lomascolo safoine27

semann's Issues

Consider JSON-LD

When I read in README.md that annotations are currently stored as JSON, I recalled that there is a standardised JSON notation for RDF graphs called JSON-LD (homepage, full specification).

For now I think it's a priority to use a triple store (#1), so this issue is definitely a low-priority one, but if JSON should ever play a role in future, I'd consider using JSON-LD, as it covers the full RDF data model (so that annotations can become arbitrarily complex).

Text selection precision issues

The current PDF.js library that is used in the prototype for displaying PDFs has gotten feedback from the evaluation participants that its precision of selecting text within the PDF was not very good – sometimes a letter would be left out when you double-clicked on a word to select it or selecting multiple words resulted in the selection of the whole paragraph unless you took extra care to avoid it.

I have also noticed that sometimes text in the selection contains whitespace in the middle of the word, which results in no matches found from the DBpedia Lookup service.

It might be worth looking into whether a newer version of PDF.js library might solve these issues or potentially even replacing the library with something better if needed.

Highlight annotation snippets in PDF

Created annotation snippets should be highlighted in the PDF display. Main problem here is that while window.getSelection() returns proper text, PDF.js renders given text line-by-line in different spans. We need to figure out what would be the best way to highlight the text after it has been selected. Shouldn't be too hard, but it's slightly tricky.

different annotation views for novice and experienced users

Those evaluation participants who were unfamiliar with RDF triples and the semantic web were sometimes puzzled by the presence of three input fields (“subject”, “property”, “object”) when annotating. It has been suggested to hide the last two from the main annotation panel until the user expresses the will to insert complicated annotations.

The above request might make sense due to the fact that the main use case is likely to be just adding simple annotations (annotations without relations) rather than complex ones (annotations with relations).

example of simple annotation:

  <rdf:Description rdf:about="http://eis.iai.uni-bonn.de/semann/pdf/what%20are%20semantic%20annotations.pdf#page=1?char=9,29&amp;id=0/0/1/1:9,0/0/1/1:29">
    <rdf:type rdf:resource="http://eis.iai.uni-bonn.de/semann/0.2/owl#Annotation" />
    <rdf:type rdf:resource="http://dbpedia.org/resource/SAWSDL" />
    <ns3:label xml:lang="en">Semantic Annotations</ns3:label>
  </rdf:Description>

example of a complex annotation:

  <rdf:Description rdf:about="http://eis.iai.uni-bonn.de/semann/pdf/Microsoft%20Word%20-%20LinkedDataOnTheWeb2008-WorkshopSummary-Proceedings.doc%20-%20Linked%20Data%20on%20the%20Web.pdf#page=1?char=127,142&amp;id=0/3/1/1:0,0/3/1/1:15">
    <rdf:type rdf:resource="http://eis.iai.uni-bonn.de/semann/0.2/sdeo#Author" />
    <rdf:type rdf:resource="http://eis.iai.uni-bonn.de/semann/0.2/owl#Annotation" />
    <ns2:hasEmail rdf:resource="http://eis.iai.uni-bonn.de/semann/pdf/Microsoft%20Word%20-%20LinkedDataOnTheWeb2008-WorkshopSummary-Proceedings.doc%20-%20Linked%20Data%20on%20the%20Web.pdf#page=1?char=176,190&amp;id=0/6/1/1:0,0/6/1/1:14" />
    <ns4:label xml:lang="en">Christian Bizer</ns4:label>
  </rdf:Description>
  <rdf:Description rdf:about="http://eis.iai.uni-bonn.de/semann/publication/hasEmail">
    <rdf:type rdf:resource="http://eis.iai.uni-bonn.de/semann/0.2/owl#isAnnotationProperty" />
    <ns4:label xml:lang="en">has email</ns4:label>
  </rdf:Description>
  <rdf:Description rdf:about="http://eis.iai.uni-bonn.de/semann/pdf/Microsoft%20Word%20-%20LinkedDataOnTheWeb2008-WorkshopSummary-Proceedings.doc%20-%20Linked%20Data%20on%20the%20Web.pdf#page=1?char=176,190&amp;id=0/6/1/1:0,0/6/1/1:14">
    <rdf:type rdf:resource="http://dbpedia.org/resource/Email" />
    <rdf:type rdf:resource="http://eis.iai.uni-bonn.de/semann/0.2/owl#Annotation" />
    <ns4:label xml:lang="en">[email protected]</ns4:label>
  </rdf:Description>

further enhancement of resource suggestions

Currently, when a user selects some text, DBpedia Lookup API tries to find a matching resource to offer as a suggestion that the user can use in order to say that the annotation is an instance of that resource.

This capability does not currently extend beyond DBpedia resource matches but it would be nice if similar kind of functionality could be applied to loaded ontologies. This would reduce the need for the user to be more or less familiar with the loaded ontology in order to search for the right class. So instead of the below search one could have immediate suggestions to select:

How this could be done exactly needs further research, e.g. comparing selected text to the class labels of loaded ontologies and offering potential matches as suggestions.

error "Discontiguous selection is not supported."

When the page loads, you will see the following rangy-core.js error in the console:
"Discontiguous selection is not supported."

It's just a console warning that can't be suppressed. It happens when Rangy library tests whether the browser supports multiple ranges in a selection during initialization. It's annoying and there is a bug report on it: https://code.google.com/p/chromium/issues/detail?id=353069#c4. However, it causes no actual harm so can be ignored.

Integration with Annotopia

I would strongly suggest to extend this tool to support communication with the Annotopia Open Annotation Server (https://github.com/Annotopia), an open universal hub for storing and publishing of annotations in the Open Annotation ontology. This means that in the true spirit of Linked Data, semantic annotations created with the SemAnn tool can then be used by other tools like the Utopia PDF viewer, once uploaded to the Annotopia server. Likewise, SemAnn tool can take advantage of the data on the Annotopia server. As a result, such integration would considerably increase the visibility of the semantically annotated data produced by the tool, making it available in standard OA format. This would be a considerable step closer to what semantic publishing is about and open up the data to everybody, i.e. not only the scientific community.

A good overview of the tool's capabilities is here: https://www.youtube.com/watch?v=UGvUbFv0Zl8
And there will be a paper out on it soon. PS. new version of Domeo (similar tool to SemAnn but for HTML) should become available around January 2015 and this will also run on that server.

Typeahead to show all values on request

Current input fields with typeahead support display matching values once you start typing. It would be nice to show all available values as well on request.

The subject is discussed in

http://stackoverflow.com/questions/12827483/bootstrap-show-all-typeahead-items-on-focus

support for deleting annotations

Add support for deleting annotations from the database, currently you can only insert new ones. This would be useful when incorrect annotations have been made that need fixing.

DBpedia Lookup matches - replace the use of links for making a selection

Replace the use of links for making a selection (example use case: selecting a DBpedia match from the list of suggestions displayed after selecting text). This is counterintuitive as the user expects it to act as a link.

allow URI input

Currently we allow the following to be entered into the input fields:

text
or select option offered

It would make sense to add support for URIs. Why?

Then you can create relations between annotations from different papers (a very interesting use case)
You can use an ontology resource without necessarily loading it.

Rangy: DOM modifications by highlighting upset range offsets

toggleRange(Range) in highlight.js applies highlights to each range at a time. This means that before the next highlighting is applied the DOM has been changed already and the range might have become invalid.

Example
Let us have the following DIV element which is identified by rangy as node "0/60/1/1"

<div>Bibliometric studies are a third group of publications. They</div>

We have 2 valid rangy ranges that we want to highlight:
Range A: 0/60/1/1:0,0/60/1/1:20
Range B: 0/60/1/1:33,0/60/1/1:54

We apply the first range:

toggleRange(Range A); // &rangyFragment=0/60/1/1:0,0/60/1/1:20

<div><span class="highlight">Bibliometric studies</span> are a third group of publications. They</div>

But when we want to apply the second range we get an error "Range is no longer valid after DOM mutation"

toggleRange(Range B); // &rangyFragment=0/60/1/1:33,0/60/1/1:54

This is because range B no longer exists at offset 33, but 13 and the above range has become invalid since applying Range A.

highlights get corrupted due to repetitive calls of highlight.init()

Problem 1
highlight.init() call is inserted to viewer.js and triggers on "pagechange" custom event of PDF.js. This is the wrong event to call it in because it triggers whenever a page is moved within the viewer. This results in hundreds of initialisations in a very short time. highlight.init() should only be called once when a pdf file is loaded, otherwise new rangy objects get created and handles to old ones get lost, resulting in loss of functionality.

Eg. currently methods like cssApplier.undoToRanges(ranges) have no effect.

The proper way to do this would be to call highlight.init() when the "documentload" custom event of PDF.js triggers. This only triggers when a pdf is opened.

Problem 2
We should avoid changing anything within the PDF.js library so as to not cause issues when we want to update the library in the future. It would be a good idea to move the changes out from viewer.js and make use of custom events of PDF.js instead.

"Fetch annotations" btn: Highlighting breaks with the use of PDFFindBar.searchAndHighlight(subject);

This PDF.js native highlighting PDFFindBar.searchAndHighlight() should not be used in our tool as it makes unexpected changes to DOM that interfere with rangy library's highlights. Also, current behaviour is not correct - the aim is not to highlight all occurrences of a term but to highlight the specific annotation originally made by the user. This can only be done via rangy deserialization method.

This problem occurs when clicking on "Fetch annotations" button.
To recreate: open pdf -> "Fetch annotations" button -> click on some term among the triples in the table -> term occurrence is highlighted with PDFFindBar.searchAndHighlight -> "Fetch annotations" button -> highlighting fails and the following error can be observed in the console by rangy-core.js:
"Uncaught Error: Range error: Range is no longer valid after DOM mutation"

'char' parameter of an annotation displays incorrect start and end positions

Annotation URIs currently display unreliable information in the "char" parameter, eg.:
http://eis.iai.uni-bonn.de/semann/pdf/mixTable.pdf#page=1?char=13,25;length=12,UTF-8&rangyPage=1&rangyFragment=0/0/1/1:1,0/0/1/1:13

Known issues with the "char":

current implementation by scientificAnnotation.getPreviousPagesCharacterCount() is not able to calculate the text length count for unloaded pages. Loading of pages does happen but too late .
relying on var currentPage = $('#pageNumber').val(); can be misleading when the viewing container is in between 2 pages.One should derive this from rangyPage or do some DOM querying for the active selection.
overall the current implementation is a bit of a black box due to lack of comments.

Another question is how important is this implementation as we do not use this parameter in the code currently. In any case, an alternative is rangyFragment, which might prove to be more reliable and easy to use.

The following code returns the length of the pages - might prove to be useful in solving this. This is adapted from example.

            var currentPage =  $('#pageNumber').val();
            var str = "";
            for (var j = 1; j <= currentPage; j++) {
                var page = PDFView.getPage(j);
                var processPageText = function processPageText(pageIndex) {
                  return function(pageData, content) {
                    return function(text) {
                      // bidiTexts has a property identifying whether this
                      // text is left-to-right or right-to-left
                      for (var i = 0; i < text.bidiTexts.length; i++) {
                        str += text.bidiTexts[i].str;
                      }

                      //if (pageData.pageInfo.pageIndex === currentPage - 1) {
                        // later this will insert into an index
                        //console.log("\n"+str);
                        console.log("\nCumulative page length = "+str.length);
                      //}
                    }
                  }
                }(j);

                var processPage = function processPage(pageData) {
                  var content = pageData.getTextContent();

                  content.then(processPageText(pageData, content));
                }
                page.then(processPage);
            }

Images for Wiki

The whole point of this issue is to get around github's limitation to uploading images to its wiki. This is a workaround.

Storing and loading of triples

Triples need to be saved and loaded from some sort of server. Original idea was to make server.T.js provider, where T can be any custom backend user implements. By default we can provide something like server.sparql.js (uses sparql queries to load-save data) and server.ontowiki.js (uses ontowiki for save-load).

UI improvements

Better UI for creating / managing annotations. This should be intuitive enough not to require the user to be familiar with the technology behind it (ontologies, RDF triples etc).
This also includes running suggestions for property names, turning literals into classes via ("what did you mean by ___?").

error "Failed to load resource: file://fonts.googleapis.com/css?family=Droid+Sans:400,700"

When the page loads, you will see the following jquery.min.js error in the console:
"Failed to load resource: file://fonts.googleapis.com/css?family=Droid+Sans:400,700"

This error is a major delay source in page load time, causing a 22 sec delay (Chrome Version 37.0.2062.120 m), 93% of the overal load time!

Downloading a newer version of jquery should fix this.

Rank recommendations based on contextual information higher

… by extending the SPARQL query for the recommendations.

Opening a file by selecting it from the "Find Similar" suggestions.

Selecting a file from the offered similar publications list does not currently open the file in the browser. This will be fixed when a demo site can be hosted somewhere.
Where to store PDF files (locally or on the server).

Move to a hosted site

Set up a demo site for the tool
Move Virtuoso backend to the server

use frames

There are UI layout issues when the right side pane exceeds the height of the left pane. Reasons to adopt frames:

It would be visually more appealing if these panes were in separate frames.
It would also allow the user to adjust the width of the left side pane to their liking.
There is no downside to using frames in this context

ontology selection improvements

Currently ontology selection has no effect - it should either remove or add values to typeahead inputs.
Uploading an ontology replaces typeahead values with the ontology last loaded, it should append to them.

different colour highlights for annotations from different ontologies

Consider using a different colour highlighting for annotations that are instances of the Semann Discourse Elements Ontology. In this way you can see better whether such annotations contain further child annotations that are instances of other ontologies. You could expand this to color coding all annotations according to what ontology they are an instance of.

This will require you to check if this is even possible with Rangy library or does Rangy merge overlapping highlights together into one single highlight.

Find similar publications functionality

Implementation of "Find similar publications" by comparing current annotations to others in the triple store.

Highlighting breaks on pages that are not currently loaded

If one were to be on page 10 and pressing the button "Fetch annotations" which returns an annotation on page 1 (and that is not currently loaded into DOM by pdf.js) then you get an error in the console during rangy deserialisation:
There was an error during highlighting. Potentially corrupted data in '0/167/1/1:10,0/167/1/1:30,0/92/1/1:0,0/92/1/1:8,0/167/1/1:33,0/169/1/1:3,0/167/1/1:33,0/169/1/1:3,0/299/1/15:8,0/299/1/15:18,0/299/1/15:8,0/299/1/15:18,0/30/1/1:0,0/34/1/1:19,0/30/1/1:0,0/34/1/1:19,0/299/1/15:8,0/299/1/15:18,0/16/1/1:0,0/16/1/1:13,0/18/1/1:0,0/18/1/1:9,0/18/1/1:0,0/18/1/1:9,0/166/1/1:0,0/167/1/1:5,0/287/1/15:4,0/287/1/15:13'.
Error in Rangy Serializer module: deserializePosition() failed: node

[1][1][<canvas id="page1" width=] has no child with index 167, 1

The reason is that pages that are not loaded, look like this:

<div id="pageContainer1" class="page" style="width: 782px; height: 1013px;">
  <div class="loadingIcon"></div>
</div>

compared to pages that are loaded (and deserialisable):

<div id="pageContainer8" class="page" style="width: 782px; height: 1013px;" data-loaded="true">
  <div class="canvasWrapper" style="width: 782px; height: 1013px;">
    <canvas id="page8" width="782" height="1013" style="width: 782px; height: 1013px;"></canvas>
  </div>
  <div class="textLayer" style="width: 782px; height: 1013px;">
    ... page contents ...
  </div>
</div>

Some ideas for fixing this:

See if there is an option that could be set in pdf.js that forces the loading of all pages
Highlight only those annotations that correspond to currently loaded pages. You can get this information from calling PDFView.getVisiblePages().

aksw / semann Goto Github PK

semann's Introduction

Semantic Annotation Tool for PDF documents

Project Current state

Currently working features:

Work in progress:

Documentation

Used libraries

Backend Database Used

semann's People

Contributors

Stargazers

Watchers

Forkers

semann's Issues

Recommend Projects

Recommend Topics

Recommend Org