The odinson from lum-ai

Increase test coverage to 60%

Sorting results by document id

A minor issue we've been dealing with is that it's useful for us to get results sorted by doc id (e.g. it allows us to more quickly display "missing" extractions when working against a labeled dataset; it allows users to get a better feeling for the frequency of matches per documents; it provides for a stable order of results when new documents are added to the index.).

We currently fetch all odinson's results and re-sort them but it would be nice if Odinson supported an option to return results sorted by doc-id (and potentially support paging by doc-id, etc.).

Add info to docs

End-user documentation

We need to add (not necessarily in this order, though I did try to put them in a semblance of order):

Nice to have:

info on testing (@gcgbarbosa)

We also need to:

re-write the main landing README page to be simpler and point people to docs

others?

Proofreading/editing of current content SUPER appreciated

Greedy quantifiers

Currently Odinson quantifiers are neither greedy nor lazy. The greedy v. lazy distinction is, however, essential moving forward. Consider a naive rule fragment for detecting an NP:

[tag=/N.*/]+

The current behavior is to return all (or all tiled?) possible matches, rather than the longest. This clutters the state and is counter-intuitive to the user considering the syntax and standard meaning of the + quantifier.

Code coverage integration

In order to know how well Odinson is covered by tests, both the language and the system, we need to integrate some form of code coverage report.

Options include:

the scoverage sbt plugin
codecov
others

We should consider pricing/long-term support when making decisions

Support queries over different datasets in Odinson's API

As developers of an application on top of Odinson it would be helpful to us if the dataset/index we query against could be determined dynamically, at query time.

One way to support this in Odinson is to add an optional "dataset name" parameter in the query API, and to require the user to specify the index path for each dataset in the config file before starting the API.

We think we might also get a similar effect by indexing multiple datasets into the same index and use the "parent query" feature in the API to restrict the results to a specific dataset. We haven't tried this direction though, and we're not sure if that's the intended use.

Search results order based on document id (not lucene id)

Is your feature request related to a problem? Please describe.
This a continuation of #20 that changed the behavior of search to return the results in the order of the internal lucene id. We would like to have the sesrch results order correspond to the defined extranal document ids and not the internal lucene id.

Describe the solution you'd like
I propose a simple workaround to load the documents into the index in the order of the document ids, this way the order induced by the lucene id will match the ordering of the document ids.
This should be controllable by configuration, as it will slow the indexing process and will not be needed in all scenarios.

Additional context
A PR is available here: #37

Add scala docs

We need scala docs to clearly document the functionality of each Odinson class.
These should be linked to the current docs.

Testing examples in the Docs

Testing

Now that the docs are live: http://gh.lum.ai/odinson/
We should make a unit test from each of the examples provided.
There are currently several examples, likely some more will be added, but there are enough to get started.

Thanks!

Increase test coverage to 65%

Increase test coverage to 65% (based on documented features)

Memory usage for annotation/indexer stage

It looks like the annotator in the indexer step can require a fair amount of memory, and occasionally crashes with out-of-memory errors, but the memory requirements aren't currently mentioned in the documentation.

export SBT_OPTS="-Xmx32g" seems to show about 22g of usage for 32 threads in a corpus of 40k documents between 200 bytes and ~2MB each. The error when it was crashing was very long (and across threads), so it was easy to miss the out of memory error burried in there -- it seemed to be throwing an error on a null, so I initially thought it was an issue with invalid characters or some other parse error on the dataset.

Seems to run much faster with more memory too.

move images to docs directory

ExtractorEngine.getTokens() should rely on config for determining raw field

Currently the ExtractorEngine's .getTokens() method assumes the doc's rawTokenField` is "raw":

odinson/core/src/main/scala/ai/lum/odinson/ExtractorEngine.scala

Line 143 in 552b49c

TokenStreamUtils.getTokens(docID, "raw", indexSearcher, analyzer)

Everywhere else these fields are determined via the config or by passing config-derived fields to constructors.

odinson stuck on a query with an optional disjunctive component

Describe the bug
From the odinson shell:

If I query (?<subject> [entity="ORGANIZATION"]+) I get immediate results.
If I query (?<subject> [entity="ORGANIZATION"]+) (a | the) I get immediate results
If I query (?<subject> [entity="ORGANIZATION"]+) (a | the)? odinson is stuck indefinitely while consuming 100% CPU.

The problem doesn't occur on very small indices, but seems to reproduce on indices of a few thousand documents (we can share a sample corpus and index if needed).

I hit this issue on the current master as well as on older versions.

Increase test coverage to 70%

Increase test coverage to 70% (based on documented features)

Benchmark Odinson rules performance

Using a few rules (events) benchmark 100, 1k, 100k docs (run flight recorder on the code)

Meet with Marco/Keith for details this week and build indexes/indices.

Simplify search result display

Currently, full TAG displays are rendered for each result in a page. This is necessarily time-consuming to render.

The initial view of the search results should be in text, with an interface to expand the result into the TAG rendering. The matches should ideally still be indicated within the sentence, perhaps by highlighting.

Testing the ! operator

Testing

Test the ! operator for token constraints.

E.g., [tag=/N.*/ & !lemma=puppy]

also test -- [tag=/N.*/ & lemma!=puppy]

Annotation agnosticism

As a user with a custom index, I would like to include arbitrary token-level tags.

I think we need to adjust the code that interacts with this portion of the config:

odinson/core/src/main/resources/reference.conf

Lines 33 to 103 in 0646f4a

    
           compiler { 
        
             # fields available per token 
        
             allTokenFields = [ 
        
               ${odinson.index.rawTokenField}, 
        
               ${odinson.index.wordTokenField}, 
        
               ${odinson.index.normalizedTokenField}, 
        
               ${odinson.index.lemmaTokenField}, 
        
               ${odinson.index.posTagTokenField}, 
        
               ${odinson.index.chunkTokenField}, 
        
               ${odinson.index.entityTokenField}, 
        
               ${odinson.index.incomingTokenField}, 
        
               ${odinson.index.outgoingTokenField}, 
        
             ] 
        
             # the token field to be used when none is specified 
        
             defaultTokenField = ${odinson.index.normalizedTokenField} 
        
             sentenceLengthField = ${odinson.index.sentenceLengthField} 
        
             dependenciesField = ${odinson.index.dependenciesField} 
        
             incomingTokenField = ${odinson.index.incomingTokenField} 
        
             outgoingTokenField = ${odinson.index.outgoingTokenField} 
        
             # if we are using the normalizedTokenField as the default 
        
             # then we should casefold the queries to the default field 
        
             # so that they match 
        
             aggressiveNormalizationToDefaultField = true 
        
           } 
        
           index { 
        
             # the raw token 
        
             rawTokenField = raw 
        
             # the word itself 
        
             wordTokenField = word 
        
             # a normalized version of the token 
        
             normalizedTokenField = norm 
        
             # the normalized field will include values from the following fields 
        
             addToNormalizedField = [ 
        
                 ${odinson.index.rawTokenField}, 
        
                 ${odinson.index.wordTokenField}, 
        
             ] 
        
             lemmaTokenField = lemma 
        
             posTagTokenField = tag 
        
             chunkTokenField = chunk 
        
             entityTokenField = entity 
        
             incomingTokenField = incoming 
        
             outgoingTokenField = outgoing 
        
             dependenciesField = dependencies 
        
             documentIdField = docId 
        
             sentenceIdField = sentId 
        
             sentenceLengthField = numWords 
        
             maxNumberOfTokensPerSentence = 100

Specifically the QueryCompiler... (e.g.,

odinson/core/src/main/scala/ai/lum/odinson/compiler/QueryCompiler.scala

Line 564 in 0646f4a

def apply(config: Config, vocabulary: Vocabulary): QueryCompiler = {

++)

Also -- while we're making changes, let's rename "dependenciesField" to "graphField" since we're trying to be all annotation agnostic and stuff. :)

Gracefully recover from bad queries

Illegal or non-functioning queries cause the UI to crash permanently. An error stack trace is displayed on the webpage, and submitting a working query will not cause the stack trace to go away, even when good results are returned by the REST API (as reflected in the npm run dev console output).

I would like graceful recovery from bad queries. Most importantly, entering a good query after a bad one should lead to results being displayed as usual. Ideally, the UI would point out the problem with the query. However, just a message briefly saying that the query contains an error would be an improvement.

Test single quote

Testing

While we allow either single quotes or double quotes to wrap strings with special stuff going on:
e.g., "3:10" to Yuma vs. '3:10' to Yuma

Behind the scenes, we're using java escaping rules, which doesn't allow single quotes.
We need to see if the single quote is working, specifically with escaped characters inside a single quote.
e.g., 'abc"def' or something with a unicode escaped char.

@marcovzla please add examples/clarify the details of this issue. thanks!

dependency vocabulary first entry is corrupted and not searchable

Describe the bug
The dependencies.txt file written to the index contains some additional bytes at the beginning of the file, causing the first dependency label to not be searchable. This is hard to detect because it only affects a single dependency label, and the specific label will change between runs of indexing.

Reproducing this behavior

Go to the index directory and open the dependencies.txt file, and look at the dependency on the first line (it will be prefixed with some junk char/s). For example lets say it was compound, then run the query: (?<p> []) <compound [], you will get 0 results.
Manually delete the junk prefix char/s on the first line of dependencies.txt and run the query again (after restarting the backend). This time you should see results.

Investigating the reason for the bug
I believe this bug was introduced by: #30
From what I understand this happens because the file is written with the FSDirectory API but read with the LumAI File Utils (that uses regular streams from java.io)

Here is a code snippet to demonstrate this behavior:

  val config = ConfigFactory.load()
  // writing the data through FSDirectory API
  val directory  = FSDirectory.open(Paths.get(config[String]("odinson.indexDir")))
  val streamOut = directory.createOutput("test.txt", new IOContext())
  streamOut.writeString("abcd")
  streamOut.close()

  // reading the data through regular java.io API
  val stream1In = new BufferedReader(
                                new InputStreamReader(
                                     new FileInputStream(config[String]("odinson.indexDir") + "/test.txt")
                                )
                             )
  println("|" + stream1In.readLine() + "|")

  // reading the data through FSDirectory API
  val stream2In = directory.openInput("test.txt", new IOContext())
  println("|" + stream2In.readString() + "|")

The output is:

|�abcd|
|abcd|

On the first output line (reading using java.io), there is an extra character that causes the corruption.

The solution should be to just use the same API for reading and writing, either the FSDirectory one or the java.io one. Just not mix them up.

How to use the pattern matcher only?

Hi,
Thanks for the great work(saw this from the AI's SPIKE-CORD demo). I'm working on information extraction(without the retrieval part) and this is very useful(your pattern matching is a lot more powerful than Spacy's rule-based matching). So it there any documentation on how to use the pattern matcher itself? There's an Odin manual but Odinson seems to have some new pattern grammar rules. Specifically, given a sentence and a pattern, I want to find the tokens that match some sub-pattern.

Thank you!

Store event arguments

This is on my todo list, but it is going to need discussion...

Improve testing coverage to 80%

Improve code coverage to 80%

Description of the current state of the tests:

3 projects [#lines without coverage]

Play App [196]
Core [790]
Extra [254]

Play App

3 files.
0% tested

Core

13 files.

Top lowest in testing %tested [#lines without testing]:

lucene: 69.1% [458]
compiler: 51.4% [136]
digraph: 56.4% [47]
state: 58.97% [48]

Extra

8 files.
0% tested.

Suggested order:

lucene package
compiler package
Extra project
PlayFramework tests
Re-evaluate the testing situation.

Create plug-in repo

@kevinlum commented on Mon Jun 29 2020

Dane: create a new repo for odinson plugins?

@MihaiSurdeanu commented on Wed Jul 15 2020

@danebell : is this addressed in an odinson issue? I think this belongs there.

Consecutive searches with results containing matches differing only in named captures do no render

Describe the bug
When a second query with named captures is run that returns the same documents and spans in the same order, the new results are not rendered in the UI if the capture groups differ in name.

To Reproduce
Steps to reproduce the behavior:

phosphorylation >nmod_of (?<theme> [])
phosphorylation >nmod_of (?<poop> [])
Disappointment

Expected behavior
New results should always be rendered.

Proposed solution
The fix is to assign unique keys that account for the content of the the results. Hashing the scoreDoc json with something like weak-key may be one option.

Odinson server cuts connection after 75 seconds

Describe the bug
When executing a long query (against the backend API) the connection is cut by the server.
The connection always terminates at the 1 minute, 15 seconds mark. This makes it hard to work with odinson against very large corpora.

To Reproduce
Any query that doesn't return in 75 seconds.

Expected behavior
I would expect the connection to stay alive until results are returned.

Screenshots/Errors
Two output of curl and the timing with date (IP parts replaced by X).

date && curl -v -X GET "http://X.X.X.X:9000/api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D" -H "accept: application/json" || date
Sun Sep  8 15:23:56 IDT 2019
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying X.X.X.X...
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 9000 (#0)
> GET /api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D HTTP/1.1
> Host: X.X.X.X:9000
> User-Agent: curl/7.58.0
> accept: application/json
> 
* Recv failure: Connection reset by peer
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
Sun Sep  8 15:25:11 IDT 2019

date && curl -v -X GET "http://X.X.X.X:9000/api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D" -H "accept: application/json" || date
Sun Sep  8 15:30:00 IDT 2019
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying X.X.X.X...
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 9000 (#0)
> GET /api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D HTTP/1.1
> Host: X.X.X.X:9000
> User-Agent: curl/7.58.0
> accept: application/json
> 
* Recv failure: Connection reset by peer
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
Sun Sep  8 15:31:16 IDT 2019

Move OdinsonIndexWriter.mkDirectedGraph to companion object

Documentation

Would there be any documentation available for getting odinson running, usage, API, documentation on the language, etc?

Testing the ! operator

Documentation type: [End-user documentation, wiki, code comments]

Improvements requested:

Link (if available):

Allow quantification over sequences of surface token and graph edges

Not sure if this is a bug or an intended behavior, but it seems that the current Odinson pattern language doesn't allow for parenthesis over a combination of surface token and graph edges. e.g. in Odin it was easy to capture an optional chain of verbs through a pattern like (>> [tag=/^V.*/])* but such a syntax does not compile in Odinson.

Define mention equality

As we introduce multiple types of mentions, we must also define equality both to prevent overmatching and enable meaningful comparisons.

Improve config behavior to allow for better overrides

support for ExecutionContext

Currently the OdinsonIndexSearcher is single threaded by default,
but you can create it with a java.util.concurrent.ExecutorService so that is searches over segments in parallel. This should reduce the query time on large indexes (in theory).
It would be nice to be able to convert a scala ExecutionContext into a java ExecutorService
so that we can initialize the index searcher with either.

It would also be nice to expose this in the config, so that we can set the number of threads to use there.

Move the rule learning code

@kevinlum commented on Mon Jun 29 2020

Then Robert: move the rule learning code there.

@MihaiSurdeanu commented on Wed Jul 15 2020

@danebell : this belongs in odinson

Add listener/control process

@kevinlum commented on Mon Jun 29 2020

Robert: add listener/control process to rule learning. Also, keep in mind that rule learning is more than binary relations. Minimally, we should learn unary relations as well.

Paginated Odinson search results

Search results using Odinson UI are currently limited to the first page of results.

The UI should have a link to the next page of results and show progress in the pagination (e.g. page 1 of 12).

To consider: allowing the user to specify results per page, perhaps from a set list (5, 10, 25, 50 per page, e.g.)

add document as heterogeneous container

By representing a document as an heterogeneous container we could add arbitrary annotations to it. This can be used for document-level metadata as well as sentence-level annotations.

See https://gerardnico.com/code/design_pattern/typesafe_heterogeneous_container

Also see https://github.com/milessabin/shapeless/wiki/Feature-overview:-shapeless-2.0.0#heterogenous-maps

Run benchmarks

Using a few rules (events) benchmark 100, 1k, 100k docs (run flight recorder on the code)

Run benchmarks based on refined criteria (from above)

Odinson totalHits slows down queries when there are many matches

Is your feature request related to a problem? Please describe.
This issue is related to #35, which makes odinson unusable on large corpora with patterns that occur frequently. This feature will mitigate the problem partially and I think it's a good feature to have regardless of the issue.

Describe the solution you'd like
I would like to control the behavior of counting all match hits via a configuration flag to speed the queries up (by not computing the totalHits).

Describe alternatives you've considered
Another approach is to have this as a parameter to the the request itself but it will not be consistent with how pageSize is defined. (also in the global configuration).
I think the distinction between what configuration is global and what should be controllable per query is a separate issue.

Additional context
I've submitted a PR for this: #34

Remove handling of single quoted strings

We currently support single and double quotes around strings, but we are using java's approach to escape characters in the string, which is designed for double quote only. Rather than jump through flaming hoops, we should just no longer support single quotes.

TODO:

remove handling in the QueryParser
edit the documentation String page

Odinson API: per query specification of odinson.pageSize

When working against the Odinson API, it is sometime desirable to obtain a different number of results for different queries. e.g. when testing out a pattern against an unlabeled dataset it's often enough to see a handful of query matches, however, when working against labeled datasets like TACRED it's sometimes desirable to retrieve all results in order to calculate statistics of correct/incorrect extractions.

We currently increase the odinson.pageSize globally, but it means that queries against unlabeled datasets sometime take longer than needed.

/api/sentence should support json.gz

/api/sentence endpoint needs to be read compressed json files.

Organization of Unit tests

Existing unit tests are mixed and shared resources are not shared.

We need to split out tests, consolidate resources for common use

Support queries over a single document in odinson's API

When developing an extractor for a relation, analysts often write patterns to match specific example sentences they've gathered in advance. To allow quick feedback, it would be nice if Odinson had the ability to run a pattern against a specific sentence or document, in their raw text form.

A possible solution would be to add an API call which takes as input a query and a text document, indexes the doc in-mem and runs the query against the in-mem index.

debug mode for rule learning

As a rule-learning user I want to be able to enable/disable the path-keeping in the match, so that I can use it for rule-learning when I need it, but can turn it off for efficiency when I don't.

I think the path-remembering code is in the path branch, so once this is in place we need to merge that into master. This will allow rule-learning from the master branch.

Support event-like mentions in state

Currently, the state is unaware of arguments/attributes, so we essentially only support textbound ("flat") mentions. Ultimately, though, we'll want to be able to handle mentions with named (and typed) arguments (cf. Odin).

Requirements

many-to-many database structure
Mention data structure (and flavors thereof)
???

ExtractorEngine.numParentDocs()

Just as we currently have a method to return the total number of Lucene documents (i.e., the number of sentences in the corpus), it would be convenient to provide a method to quickly count the total number of parent documents in the corpus.

Provide examples for using the state

We need to create a wiki page of something equivalent that demonstrates how the state can be used both programmatically and through the UI.

	compiler {

	# fields available per token
	allTokenFields = [
	${odinson.index.rawTokenField},
	${odinson.index.wordTokenField},
	${odinson.index.normalizedTokenField},
	${odinson.index.lemmaTokenField},
	${odinson.index.posTagTokenField},
	${odinson.index.chunkTokenField},
	${odinson.index.entityTokenField},
	${odinson.index.incomingTokenField},
	${odinson.index.outgoingTokenField},
	]

	# the token field to be used when none is specified
	defaultTokenField = ${odinson.index.normalizedTokenField}

	sentenceLengthField = ${odinson.index.sentenceLengthField}

	dependenciesField = ${odinson.index.dependenciesField}

	incomingTokenField = ${odinson.index.incomingTokenField}

	outgoingTokenField = ${odinson.index.outgoingTokenField}

	# if we are using the normalizedTokenField as the default
	# then we should casefold the queries to the default field
	# so that they match
	aggressiveNormalizationToDefaultField = true

	}

	index {

	# the raw token
	rawTokenField = raw

	# the word itself
	wordTokenField = word

	# a normalized version of the token
	normalizedTokenField = norm

	# the normalized field will include values from the following fields
	addToNormalizedField = [
	${odinson.index.rawTokenField},
	${odinson.index.wordTokenField},
	]

	lemmaTokenField = lemma

	posTagTokenField = tag

	chunkTokenField = chunk

	entityTokenField = entity

	incomingTokenField = incoming

	outgoingTokenField = outgoing

	dependenciesField = dependencies

	documentIdField = docId

	sentenceIdField = sentId

	sentenceLengthField = numWords

	maxNumberOfTokensPerSentence = 100

lum-ai / odinson Goto Github PK

odinson's People

Contributors

Stargazers

Watchers

Forkers

odinson's Issues

Play App

Core

Extra

Requirements

Recommend Projects

Recommend Topics

Recommend Org