Giter Site home page Giter Site logo

lum-ai / odinson Goto Github PK

View Code? Open in Web Editor NEW
66.0 66.0 23.0 12.63 MB

Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.

Home Page: https://lum.ai/odinson/docs/

License: Apache License 2.0

Scala 100.00%
extraction-engine information-extraction nlp odinson open-source rule-based surface syntax text-mining

odinson's People

Contributors

beckysharp avatar dependabot[bot] avatar gcgbarbosa avatar kevinlum avatar kwalcock avatar marcovzla avatar maxaalexeeva avatar mcshlain avatar mihaisurdeanu avatar myedibleenso avatar nezda avatar reynoldsm88 avatar robertvacareanu avatar schmmd avatar victoryhb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

odinson's Issues

Sorting results by document id

A minor issue we've been dealing with is that it's useful for us to get results sorted by doc id (e.g. it allows us to more quickly display "missing" extractions when working against a labeled dataset; it allows users to get a better feeling for the frequency of matches per documents; it provides for a stable order of results when new documents are added to the index.).

We currently fetch all odinson's results and re-sort them but it would be nice if Odinson supported an option to return results sorted by doc-id (and potentially support paging by doc-id, etc.).

Add info to docs

End-user documentation

We need to add (not necessarily in this order, though I did try to put them in a semblance of order):

  • rest-api info
  • docker info
  • event query info
  • state (once it's ready)
  • the yaml format / working with rule files
  • as part of yaml format -> variables
  • walk-through example (ref external webapp)
  • walk-through example in code
  • structure of a Mention
  • accessing things like named captures, etc.
  • structure of an OdinsonDocument

Nice to have:

We also need to:

  • re-write the main landing README page to be simpler and point people to docs

others?

Proofreading/editing of current content SUPER appreciated

Greedy quantifiers

Currently Odinson quantifiers are neither greedy nor lazy. The greedy v. lazy distinction is, however, essential moving forward. Consider a naive rule fragment for detecting an NP:

[tag=/N.*/]+

The current behavior is to return all (or all tiled?) possible matches, rather than the longest. This clutters the state and is counter-intuitive to the user considering the syntax and standard meaning of the + quantifier.

Code coverage integration

In order to know how well Odinson is covered by tests, both the language and the system, we need to integrate some form of code coverage report.

Options include:

  • the scoverage sbt plugin
  • codecov
  • others

We should consider pricing/long-term support when making decisions

Support queries over different datasets in Odinson's API

As developers of an application on top of Odinson it would be helpful to us if the dataset/index we query against could be determined dynamically, at query time.

One way to support this in Odinson is to add an optional "dataset name" parameter in the query API, and to require the user to specify the index path for each dataset in the config file before starting the API.

We think we might also get a similar effect by indexing multiple datasets into the same index and use the "parent query" feature in the API to restrict the results to a specific dataset. We haven't tried this direction though, and we're not sure if that's the intended use.

Search results order based on document id (not lucene id)

Is your feature request related to a problem? Please describe.
This a continuation of #20 that changed the behavior of search to return the results in the order of the internal lucene id. We would like to have the sesrch results order correspond to the defined extranal document ids and not the internal lucene id.

Describe the solution you'd like
I propose a simple workaround to load the documents into the index in the order of the document ids, this way the order induced by the lucene id will match the ordering of the document ids.
This should be controllable by configuration, as it will slow the indexing process and will not be needed in all scenarios.

Additional context
A PR is available here: #37

Add scala docs

We need scala docs to clearly document the functionality of each Odinson class.
These should be linked to the current docs.

Testing examples in the Docs

Testing

Now that the docs are live: http://gh.lum.ai/odinson/
We should make a unit test from each of the examples provided.
There are currently several examples, likely some more will be added, but there are enough to get started.

Thanks!

Memory usage for annotation/indexer stage

It looks like the annotator in the indexer step can require a fair amount of memory, and occasionally crashes with out-of-memory errors, but the memory requirements aren't currently mentioned in the documentation.

export SBT_OPTS="-Xmx32g" seems to show about 22g of usage for 32 threads in a corpus of 40k documents between 200 bytes and ~2MB each. The error when it was crashing was very long (and across threads), so it was easy to miss the out of memory error burried in there -- it seemed to be throwing an error on a null, so I initially thought it was an issue with invalid characters or some other parse error on the dataset.

Seems to run much faster with more memory too.

odinson stuck on a query with an optional disjunctive component

Describe the bug
From the odinson shell:

  • If I query (?<subject> [entity="ORGANIZATION"]+) I get immediate results.
  • If I query (?<subject> [entity="ORGANIZATION"]+) (a | the) I get immediate results
  • If I query (?<subject> [entity="ORGANIZATION"]+) (a | the)? odinson is stuck indefinitely while consuming 100% CPU.

The problem doesn't occur on very small indices, but seems to reproduce on indices of a few thousand documents (we can share a sample corpus and index if needed).

I hit this issue on the current master as well as on older versions.

Benchmark Odinson rules performance

Using a few rules (events) benchmark 100, 1k, 100k docs (run flight recorder on the code)

Meet with Marco/Keith for details this week and build indexes/indices.

Simplify search result display

Currently, full TAG displays are rendered for each result in a page. This is necessarily time-consuming to render.

The initial view of the search results should be in text, with an interface to expand the result into the TAG rendering. The matches should ideally still be indicated within the sentence, perhaps by highlighting.

Testing the ! operator

Testing

Test the ! operator for token constraints.

E.g., [tag=/N.*/ & !lemma=puppy]

also test -- [tag=/N.*/ & lemma!=puppy]

Annotation agnosticism

As a user with a custom index, I would like to include arbitrary token-level tags.

I think we need to adjust the code that interacts with this portion of the config:

compiler {
# fields available per token
allTokenFields = [
${odinson.index.rawTokenField},
${odinson.index.wordTokenField},
${odinson.index.normalizedTokenField},
${odinson.index.lemmaTokenField},
${odinson.index.posTagTokenField},
${odinson.index.chunkTokenField},
${odinson.index.entityTokenField},
${odinson.index.incomingTokenField},
${odinson.index.outgoingTokenField},
]
# the token field to be used when none is specified
defaultTokenField = ${odinson.index.normalizedTokenField}
sentenceLengthField = ${odinson.index.sentenceLengthField}
dependenciesField = ${odinson.index.dependenciesField}
incomingTokenField = ${odinson.index.incomingTokenField}
outgoingTokenField = ${odinson.index.outgoingTokenField}
# if we are using the normalizedTokenField as the default
# then we should casefold the queries to the default field
# so that they match
aggressiveNormalizationToDefaultField = true
}
index {
# the raw token
rawTokenField = raw
# the word itself
wordTokenField = word
# a normalized version of the token
normalizedTokenField = norm
# the normalized field will include values from the following fields
addToNormalizedField = [
${odinson.index.rawTokenField},
${odinson.index.wordTokenField},
]
lemmaTokenField = lemma
posTagTokenField = tag
chunkTokenField = chunk
entityTokenField = entity
incomingTokenField = incoming
outgoingTokenField = outgoing
dependenciesField = dependencies
documentIdField = docId
sentenceIdField = sentId
sentenceLengthField = numWords
maxNumberOfTokensPerSentence = 100

Specifically the QueryCompiler... (e.g.,

def apply(config: Config, vocabulary: Vocabulary): QueryCompiler = {
++)

Also -- while we're making changes, let's rename "dependenciesField" to "graphField" since we're trying to be all annotation agnostic and stuff. :)

Gracefully recover from bad queries

Illegal or non-functioning queries cause the UI to crash permanently. An error stack trace is displayed on the webpage, and submitting a working query will not cause the stack trace to go away, even when good results are returned by the REST API (as reflected in the npm run dev console output).

I would like graceful recovery from bad queries. Most importantly, entering a good query after a bad one should lead to results being displayed as usual. Ideally, the UI would point out the problem with the query. However, just a message briefly saying that the query contains an error would be an improvement.

Test single quote

Testing

While we allow either single quotes or double quotes to wrap strings with special stuff going on:
e.g., "3:10" to Yuma vs. '3:10' to Yuma

Behind the scenes, we're using java escaping rules, which doesn't allow single quotes.
We need to see if the single quote is working, specifically with escaped characters inside a single quote.
e.g., 'abc"def' or something with a unicode escaped char.

@marcovzla please add examples/clarify the details of this issue. thanks!

dependency vocabulary first entry is corrupted and not searchable

Describe the bug
The dependencies.txt file written to the index contains some additional bytes at the beginning of the file, causing the first dependency label to not be searchable. This is hard to detect because it only affects a single dependency label, and the specific label will change between runs of indexing.

Reproducing this behavior

  1. Go to the index directory and open the dependencies.txt file, and look at the dependency on the first line (it will be prefixed with some junk char/s). For example lets say it was compound, then run the query: (?<p> []) <compound [], you will get 0 results.

  2. Manually delete the junk prefix char/s on the first line of dependencies.txt and run the query again (after restarting the backend). This time you should see results.

Investigating the reason for the bug
I believe this bug was introduced by: #30
From what I understand this happens because the file is written with the FSDirectory API but read with the LumAI File Utils (that uses regular streams from java.io)

Here is a code snippet to demonstrate this behavior:

  val config = ConfigFactory.load()
  // writing the data through FSDirectory API
  val directory  = FSDirectory.open(Paths.get(config[String]("odinson.indexDir")))
  val streamOut = directory.createOutput("test.txt", new IOContext())
  streamOut.writeString("abcd")
  streamOut.close()

  // reading the data through regular java.io API
  val stream1In = new BufferedReader(
                                new InputStreamReader(
                                     new FileInputStream(config[String]("odinson.indexDir") + "/test.txt")
                                )
                             )
  println("|" + stream1In.readLine() + "|")

  // reading the data through FSDirectory API
  val stream2In = directory.openInput("test.txt", new IOContext())
  println("|" + stream2In.readString() + "|")

The output is:

|�abcd|
|abcd|

On the first output line (reading using java.io), there is an extra character that causes the corruption.

The solution should be to just use the same API for reading and writing, either the FSDirectory one or the java.io one. Just not mix them up.

How to use the pattern matcher only?

Hi,
Thanks for the great work(saw this from the AI's SPIKE-CORD demo). I'm working on information extraction(without the retrieval part) and this is very useful(your pattern matching is a lot more powerful than Spacy's rule-based matching). So it there any documentation on how to use the pattern matcher itself? There's an Odin manual but Odinson seems to have some new pattern grammar rules. Specifically, given a sentence and a pattern, I want to find the tokens that match some sub-pattern.

Thank you!

Improve testing coverage to 80%

Improve code coverage to 80%

Description of the current state of the tests:

3 projects [#lines without coverage]

  1. Play App [196]
  2. Core [790]
  3. Extra [254]

Play App

3 files.
0% tested

Core

13 files.

Top lowest in testing %tested [#lines without testing]:

  1. lucene: 69.1% [458]
  2. compiler: 51.4% [136]
  3. digraph: 56.4% [47]
  4. state: 58.97% [48]

Extra

8 files.
0% tested.

Suggested order:

  1. lucene package
  2. compiler package
  3. Extra project
  4. PlayFramework tests
  5. Re-evaluate the testing situation.

Consecutive searches with results containing matches differing only in named captures do no render

Describe the bug
When a second query with named captures is run that returns the same documents and spans in the same order, the new results are not rendered in the UI if the capture groups differ in name.

To Reproduce
Steps to reproduce the behavior:

  1. phosphorylation >nmod_of (?<theme> [])
  2. phosphorylation >nmod_of (?<poop> [])
  3. Disappointment

Expected behavior
New results should always be rendered.

Proposed solution
The fix is to assign unique keys that account for the content of the the results. Hashing the scoreDoc json with something like weak-key may be one option.

Odinson server cuts connection after 75 seconds

Describe the bug
When executing a long query (against the backend API) the connection is cut by the server.
The connection always terminates at the 1 minute, 15 seconds mark. This makes it hard to work with odinson against very large corpora.

To Reproduce
Any query that doesn't return in 75 seconds.

Expected behavior
I would expect the connection to stay alive until results are returned.

Screenshots/Errors
Two output of curl and the timing with date (IP parts replaced by X).

date && curl -v -X GET "http://X.X.X.X:9000/api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D" -H "accept: application/json" || date
Sun Sep  8 15:23:56 IDT 2019
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying X.X.X.X...
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 9000 (#0)
> GET /api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D HTTP/1.1
> Host: X.X.X.X:9000
> User-Agent: curl/7.58.0
> accept: application/json
> 
* Recv failure: Connection reset by peer
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
Sun Sep  8 15:25:11 IDT 2019
date && curl -v -X GET "http://X.X.X.X:9000/api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D" -H "accept: application/json" || date
Sun Sep  8 15:30:00 IDT 2019
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying X.X.X.X...
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 9000 (#0)
> GET /api/search?odinsonQuery=%5B%5D%20%3Edobj%20%5B%5D HTTP/1.1
> Host: X.X.X.X:9000
> User-Agent: curl/7.58.0
> accept: application/json
> 
* Recv failure: Connection reset by peer
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
Sun Sep  8 15:31:16 IDT 2019

Documentation

Would there be any documentation available for getting odinson running, usage, API, documentation on the language, etc?

Testing the ! operator

Documentation type: [End-user documentation, wiki, code comments]

Improvements requested:

Link (if available):

Allow quantification over sequences of surface token and graph edges

Not sure if this is a bug or an intended behavior, but it seems that the current Odinson pattern language doesn't allow for parenthesis over a combination of surface token and graph edges. e.g. in Odin it was easy to capture an optional chain of verbs through a pattern like (>> [tag=/^V.*/])* but such a syntax does not compile in Odinson.

Define mention equality

As we introduce multiple types of mentions, we must also define equality both to prevent overmatching and enable meaningful comparisons.

support for ExecutionContext

Currently the OdinsonIndexSearcher is single threaded by default,
but you can create it with a java.util.concurrent.ExecutorService so that is searches over segments in parallel. This should reduce the query time on large indexes (in theory).
It would be nice to be able to convert a scala ExecutionContext into a java ExecutorService
so that we can initialize the index searcher with either.

It would also be nice to expose this in the config, so that we can set the number of threads to use there.

Paginated Odinson search results

Search results using Odinson UI are currently limited to the first page of results.

The UI should have a link to the next page of results and show progress in the pagination (e.g. page 1 of 12).

To consider: allowing the user to specify results per page, perhaps from a set list (5, 10, 25, 50 per page, e.g.)

Run benchmarks

Using a few rules (events) benchmark 100, 1k, 100k docs (run flight recorder on the code)

Run benchmarks based on refined criteria (from above)

Odinson totalHits slows down queries when there are many matches

Is your feature request related to a problem? Please describe.
This issue is related to #35, which makes odinson unusable on large corpora with patterns that occur frequently. This feature will mitigate the problem partially and I think it's a good feature to have regardless of the issue.

Describe the solution you'd like
I would like to control the behavior of counting all match hits via a configuration flag to speed the queries up (by not computing the totalHits).

Describe alternatives you've considered
Another approach is to have this as a parameter to the the request itself but it will not be consistent with how pageSize is defined. (also in the global configuration).
I think the distinction between what configuration is global and what should be controllable per query is a separate issue.

Additional context
I've submitted a PR for this: #34

Remove handling of single quoted strings

We currently support single and double quotes around strings, but we are using java's approach to escape characters in the string, which is designed for double quote only. Rather than jump through flaming hoops, we should just no longer support single quotes.

TODO:

  • remove handling in the QueryParser
  • edit the documentation String page

Odinson API: per query specification of odinson.pageSize

When working against the Odinson API, it is sometime desirable to obtain a different number of results for different queries. e.g. when testing out a pattern against an unlabeled dataset it's often enough to see a handful of query matches, however, when working against labeled datasets like TACRED it's sometimes desirable to retrieve all results in order to calculate statistics of correct/incorrect extractions.

We currently increase the odinson.pageSize globally, but it means that queries against unlabeled datasets sometime take longer than needed.

Organization of Unit tests

Existing unit tests are mixed and shared resources are not shared.

We need to split out tests, consolidate resources for common use

Support queries over a single document in odinson's API

When developing an extractor for a relation, analysts often write patterns to match specific example sentences they've gathered in advance. To allow quick feedback, it would be nice if Odinson had the ability to run a pattern against a specific sentence or document, in their raw text form.

A possible solution would be to add an API call which takes as input a query and a text document, indexes the doc in-mem and runs the query against the in-mem index.

debug mode for rule learning

As a rule-learning user I want to be able to enable/disable the path-keeping in the match, so that I can use it for rule-learning when I need it, but can turn it off for efficiency when I don't.

I think the path-remembering code is in the path branch, so once this is in place we need to merge that into master. This will allow rule-learning from the master branch.

Support event-like mentions in state

Currently, the state is unaware of arguments/attributes, so we essentially only support textbound ("flat") mentions. Ultimately, though, we'll want to be able to handle mentions with named (and typed) arguments (cf. Odin).

Requirements

  • many-to-many database structure
  • Mention data structure (and flavors thereof)
  • ???

ExtractorEngine.numParentDocs()

Just as we currently have a method to return the total number of Lucene documents (i.e., the number of sentences in the corpus), it would be convenient to provide a method to quickly count the total number of parent documents in the corpus.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.