vnadgir / dkpro-core-asl Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/dkpro-core-asl
Automatically exported from code.google.com/p/dkpro-core-asl
replaceTest3() in AlignedStringTest reproduces the error.
Original issue reported on code.google.com by [email protected]
on 7 Mar 2012 at 3:55
getUrlAsFile() should take care that the temporary files have the same
extension as specified in the URL. E.g. if the URL ends in ".exe", the
temporary file should also end in ".exe", but currently it ends in "exe" only
(no dot).
Original issue reported on code.google.com by richard.eckart
on 16 Jun 2011 at 9:29
The wikipedia readers have become quite complex and do not cover new
funtionalities of JWPL as e.g. revisions. A new set of readers should be
provided.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:17
A reader for the BNC XML corpus format would be nice to have.
Original issue reported on code.google.com by richard.eckart
on 23 Dec 2011 at 11:03
As in summary.
Original issue reported on code.google.com by [email protected]
on 21 Jan 2011 at 4:22
In several tasks we need access to n-gram frequencies, e.g. from the Google
n-gram corpus. These should be provided as an external resource.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:19
The corpus hierarchy is broken at the moment.
Several corpora do not implement the Corpus interface.
Original issue reported on code.google.com by [email protected]
on 31 Oct 2011 at 5:21
The documentUri is set to the ID of the document and the documentId is set to a
running number. Since multiple documents are in one file, the URI should be set
to something like file://path#docId and the documentId should be set to the
docId.
Original issue reported on code.google.com by richard.eckart
on 14 Jan 2012 at 5:57
As in summary.
Original issue reported on code.google.com by [email protected]
on 21 Jan 2011 at 4:18
Being able to read and write the IMS Corpus Workbench tab-separated format
would be useful. We could use it to export corpora for search with CQP. Also,
we could read the WaCky corpora.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:26
The analysis engine cannot deal with cases where TreeTagger does not output a
POS and lemma.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:12
Background: most spelling correctors return more than one suggestion.
The alternative would be to create one annotation per suggestion and merge
later via offset comparison (which sounds like too much of a hassle).
Original issue reported on code.google.com by [email protected]
on 8 Aug 2011 at 6:51
How to reproduce the issue:
1. Let an AnalysisEngine process a document collection read with
de.tudarmstadt.ukp.dkpro.core.io.text.TextReader containing a whitespace in its
path (Example: /var/lib/jenkins/jobs/DKPro
Semantics/workspace/trunk/de.tudarmstadt.ukp.dkpro.semantics.bookindexing/src/te
st/resources/PhraseMatchEvaluator/)
2. Let the AnalysisEngine extract the String representation of the URI from the
DocumentMetaData and try to instantiate a new URI instance: URI uri = new
URI(DocumentMetaData.get(jcas).getDocumentUri());
3. An exception will be thrown:
java.net.URISyntaxException: Illegal character in path at index 32:
file:/var/lib/jenkins/jobs/DKPro
Semantics/workspace/trunk/de.tudarmstadt.ukp.dkpro.semantics.bookindexing/src/te
st/resources/PhraseMatchEvaluator/tokens%201.txt
The URI seems to be stored invalid in the DocumentMetaData, as the whitespace
in the path not been encoded as "%20". The basename is encoded correctly though.
Original issue reported on code.google.com by [email protected]
on 5 Sep 2011 at 3:21
Add a TokenFilter component to remove tokens from the CAS.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:08
We need to document the "inaccuracies" of CasToInlineXml and possibly include
some sanity checks that log warnings if a CAS contains overlapping annotations
and complex feature structures being used as features - just to be that novice
users are aware that strange things my be happening:
- Features whose values are FeatureStructures are not represented.
- Feature values which are strings longer than 64 characters are truncated.
- Feature values which are arrays of primitives are represented by strings that
look like [ xxx, xxx ]
- The Subject of analysis is presumed to be a text string.
-Some characters in the document's Subject-of-analysis are replaced by blanks,
because the characters aren't valid in xml documents.
- It doesn't work for annotations which are overlapping, because these cannot
be properly represented as properly - nested XML.
Original issue reported on code.google.com by richard.eckart
on 29 Mar 2011 at 6:50
The NEGRA export format is one of the formats used by the Tiger Corpus and by
TüBa D/Z. It would be nice to be able to read them.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:24
The UIMA CAS Editor expects a file called TypeSystem.xml in the project root.
It would be convenient if the XmiWriter could be configured to write the type
system in that location.
Original issue reported on code.google.com by richard.eckart
on 17 Apr 2011 at 5:27
ResourceCollectionReaderBase has a mandatory configuration parameter
"PARAM_PATTERNS" which IMHO should not be mandatory. Dfeault should be loading
all documents.
Original issue reported on code.google.com by richard.eckart
on 25 Mar 2011 at 2:52
Need information about the database connection a CAS was generated from.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:21
I think TokenFilter should be renamed to AnnotationByLengthFilter and be
changed to work on any kind of annotation instead of just working on tokens. I
suppose it should accept a list of types even. Probably this list could contain
Token as default.
Original issue reported on code.google.com by richard.eckart
on 7 May 2011 at 10:43
The model is always coupled to the language code parameter. There should be
additional parameters to override the model and the model encoding for the case
that somebody wants to specify a custom model.
Original issue reported on code.google.com by richard.eckart
on 3 Jan 2011 at 12:30
When there is a very very long token in the text, the analysis engine fails.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:14
There are no Tokens generated for "c" tags.
Original issue reported on code.google.com by richard.eckart
on 29 Jan 2012 at 1:21
The TreeTaggerPosLemma annotation always creates POS Tags and Lemmas. In the
case that a corpus is read that already provides POS tags, it would be nice to
only add the lemmas. Thus there should be switches to enable/disable the
creation of Lemma and POS annotations.
Original issue reported on code.google.com by richard.eckart
on 28 May 2011 at 2:58
So far, the Web1TFormatWriter always writes Token frequencies.
It should be possible to use different annotation types, e.g. Lemmas.
Original issue reported on code.google.com by [email protected]
on 14 Oct 2011 at 12:21
The artifactId of the new ark-tweek module does not end in "-asl".
Original issue reported on code.google.com by richard.eckart
on 3 Feb 2012 at 12:26
We have some file-based writers that all work slightly differently, in
particular these:
- XmiWriter,
- XmlWriterInline,
- TextWriter
They also all have slightly different parameter names, do not all support
compression, etc.
Original issue reported on code.google.com by richard.eckart
on 28 Jan 2012 at 7:50
Added a failing test (currently ignored) that tries to read a tiger corpus file.
This should be in Negra export format, but cannot be read with the current
version of the reader.
Original issue reported on code.google.com by [email protected]
on 29 Sep 2011 at 3:44
I would be nice to optionally secify all neccessary properties, executables and
resources in the parameters of the analysis engine.
Example:
The TreeTagger installation for its wrapper in DKPro is currently only added by
maven. It's not possible for other developers to include DKPro components only
with the descriptors.
Original issue reported on code.google.com by [email protected]
on 1 Feb 2011 at 2:06
It would be helpful if the model and binary JARs would contain Maven metadata
that Artifactory could read and already fill in the deploy form.
Original issue reported on code.google.com by richard.eckart
on 27 Jun 2011 at 9:27
I would appreciate a new boolean parameter in BreakIteratorSegmenter which
constitutes whether to mark punctuation marks as tokens or not.
(If available, see Bug 851 in DKPro Semantics.)
Thanks in advance,
Marko
Original issue reported on code.google.com by [email protected]
on 19 May 2011 at 4:22
Added a failing test that is currently ignored.
Original issue reported on code.google.com by [email protected]
on 5 Apr 2011 at 1:02
What steps will reproduce the problem?
1. Use the WikipediaQueryReader with the parameter PARAM_MIN_TOKENS or
PARAM_MAX_TOKENS, respectively. Using a locally running MySQL DB.
What is the expected output? What do you see instead?
Get only Wikipages with at least MIN_TOKENS and not more than MAX_TOKENS.
Causes an IndexOutOfBounds-Exception during a substring operation (judging by
the debugger)
On what operating system?
OS-X
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:52
BreakIteratorSegmenter does not produce tokens unless sentences are also enabled
Original issue reported on code.google.com by richard.eckart
on 28 Apr 2011 at 10:58
WikipediaStandardReaderBase uses the collectionId property for the pageId and
leaves the documentId field empty. This causes problems with other components
in a pipeline which expect that documentId is always set. In general we
consider documentUri and documentId to be mandatory. baseUri and collectionId
are optional. If baseUri is present, it has to be a prefix of docUri.
E.g. TextWriter tries to use documentUri and baseUri to determine the relative
output path and file name.
Original issue reported on code.google.com by richard.eckart
on 30 Aug 2011 at 5:38
Support to write data in the RelAnnis format used by Annis2 would be nice.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:23
Added a failing test that is currently ignored.
Original issue reported on code.google.com by [email protected]
on 5 Apr 2011 at 1:02
SegmenterBase creates annotations that are never added to the indexes and thus
wastes memory, because the necessary memory is still reserved in the CAS.
Original issue reported on code.google.com by richard.eckart
on 2 Oct 2011 at 3:10
Currently no sentence boundary markers are written which means it does not
really write the correct format.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2011 at 10:28
At the moment, languages that are supported by the TreeTagger but do not yet
have a mapping to the DKPro type system cannot be used with the TreeTagger AE.
We should add a standard-mapping for non-supported languages that maps all
POS-Tags to some general purpose annotation (I think "O" (=Other) is currently
used for non-mappable types). The original POS-Values can then be retrieved
from PosValue-feature of the O-Annotations.
This should not be seen as a replacement for a language mapping - but as a work
around for new languages until a new mapping to the DKPro-type system has been
created.
Original issue reported on code.google.com by oliver.ferschke
on 9 May 2011 at 10:12
Provide some support to write CQP indexes directly, e.g. by calling cwb-makeall
from within the writer and passing all data and configuration directly to it.
Original issue reported on code.google.com by richard.eckart
on 23 Dec 2011 at 9:29
Per default the TreeTagger wrapper should intern POS values and lemmas to save
memory. It should be an option however, as somebody may not want to incur the
additional overhead.
Original issue reported on code.google.com by richard.eckart
on 29 May 2011 at 8:53
Snowball comes with a set of standard stopword lists. Per default the tagger
should detect which language a document has and use the standard list for that
language. It should be possible to turn that behaviour off via a parameter.
Another parameter should allow to load additional stopword lists.
Original issue reported on code.google.com by richard.eckart
on 10 Jan 2011 at 1:39
Currently Stem and Lemma are defined in the Segmentation API. Arguably, they
don't have anything to do with that API other than being used as features in
Token. The types should be moved to the LexMorph API.
Original issue reported on code.google.com by richard.eckart
on 6 Sep 2011 at 12:30
Currently DKPro TreeTagger supports auto-lookup of model files. It looks up and
loads the appropriate language model automatically according to the document
language. All other DKPro analysis engines (AEs) doesn't possess this ability
yet.
Dive into DKPro TreeTagger and learn how it does such auto-lookup. Can this
mechanism be encapsulated into ExternalResource? Goal is to let AE
automatically gain this auto-lookup feature, when such an object is passed in
in the parameter for model file location.
Furthermore, specific default paths should be configurable via property files.
Lastly, can it load concrete resources lazily? Meaning to load the resource the
moment it is first used. (Good starting point: ExternalResourceFactory of
UIMAFit, line 220)
For the lazy-loeading resources, have a look at the class ParametrizedResource
in org.uimafit.factory.ExternalResourceFactoryTest.
There is one more aspect to this issue: tags produced by the TreeTagger or
other analysis components do not directly correspond to UIMA types. We usually
have a generic base type, e.g. POS for Part-of-Speech annotations and more
specific subtypes, e.g. V for verbs, N for nouns, etc. The same for parsers or
named entity recognition. The generic model resource should also have some
method getUimaType(String tag) were you pass in a tag and it retuns a UIMA type
to use for the annotation. See
de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerTT4JBase.getTagType(DKProMode
l,
String, TypeSystem) for how this is done in the TreeTagger component.
Original issue reported on code.google.com by richard.eckart
on 3 Oct 2011 at 7:19
Checksums in TreeTagger resource packaging ant file are outdated.
Original issue reported on code.google.com by richard.eckart
on 7 May 2011 at 8:14
Currently the Web1t writer uses the current platform encoding to write files.
Per default it should use UTF-8 to write file and there should be a parameter
to change the encoding if desired. For the parameter, the conventions from the
api.parameter module should be used.
Original issue reported on code.google.com by richard.eckart
on 3 Oct 2011 at 6:35
When using DKPro Core in a Eclipse WTP project, Eclipse has the bad habit of
creating META-INF/MANIFEST.MF under src/main/java - of course without license.
This causes builds in Eclipse to behave strangely as the RAT plugin is executed
as part of the build by m2eclipse and RAT fails.
We could either run the RAT plugin in another phase or add an exclude.
Original issue reported on code.google.com by richard.eckart
on 14 Jul 2011 at 1:04
Adding DocumentMetaData after text has been set means ending up with two
DocumentAnnotation instances in the CAS, one created by UIMA when
setDocumentText() is called and one created by DocumentMetaData.create(). This
should *just work* without having to think too much about it.
Original issue reported on code.google.com by richard.eckart
on 4 Jan 2012 at 9:51
The patched snowball from Lucene has "stem" as a method on SnowballProgram but
if we have some other snowball also in the classpath, Java might choose to use
the other. So to be safe, we should use a reflection here.
Original issue reported on code.google.com by richard.eckart
on 17 Apr 2011 at 5:57
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.