polmine / bignlp Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 1004 KB

Tools to process large corpora line-by-line and in parallel mode

R 100.00%

bignlp's People

Contributors

Stargazers

Watchers

Forkers

studentenfutter

bignlp's Issues

"NA TOLL" error

NA - both in capital letters - is passed as an empty string and not encoded correctly later on. Might be a cwbtools issue as well.

Update to StanfordCoreMLP 4.5.2

The documentation refers to an outdated version.-

Setting output_format twice necessary?

It is confusing for users that the output_format is set twice.

First, in the properties:

props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE)
props <- properties(x = props_file)
properties_set_output_format(props, "conll")
properties_set_threads(props, no_cores)

Second, when initializing the StanfordCoreNLP class.

At this stage, both is necessary. The properties control which output is generated when calling $process_files(), the argument of the $initialize() method instantiates an outputter that is required to process CoNLL (XML, JSON) output.

This is hard for the user to understand, i.e. the initialization should also set the properties.

AnnotationList$as.data.table() results in Illegal reflective access warning

When calling AnnotationList$as.data.table() you may (almost certainly) see the following warning:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by RJavaTools to method java.util.Arrays$ArrayList.size()
WARNING: Please consider reporting this to the maintainers of RJavaTool

To reproduce this, use the following example for the AnnotationList class:

library(bignlp)
docs <- c("This is a first document.", "This is another document!")
annoli <- AnnotationList$new(docs)
Pipe <- AnnotationPipeline$new()
Pipe$annotate(annoli)
annoli$as.data.table()

The warning is issued the first time you access the enriched ArrayList with the annotations, see the following minimal example.

library(bignlp)
annoli <- AnnotationList$new(c("This is a first document.", "This is another document!"))
AnnotationPipeline$new()$annotate(annoli)
annoli$obj$size()

The warning results from Java and is a frequent matter of discussions. As I understand it, the concept of modules introduced with Java 9 is less permissive to access classes/methods across modules if they are not exposed/imported properly. You can control the behavior of reflective access using java parameters, but you cannot turn off the warning. No matter what setting you choose for "--illegal-access" (for options, see this), the warning is issued at least once:

options(java.parameters = c("-Xmx4g", "--illegal-accesss=permit"))
library(bignlp)
annoli <- AnnotationList$new(c("This is a first document.", "This is another document!"))
AnnotationPipeline$new()$annotate(annoli)
annoli$obj$size()

As we know that this warning does not necessarily indicate a serious problem, we might consider suppressing it. But my experiments with traditional R functions such as capture.output(), sink(), suppressWarnings() have not been successful.

So at this stage, it seems to me that we have to live with this warning and that we might have to leave it with explaining in the documentation that users will see this warning which can be safely ignored.

To conclude, there are two things I have not tried yet:

In my experiments, I used OpenJDK, not the Java version recommended by the CoreNLP team. Would using the "proper" Java version make a difference?
Is there a way to import the module (is it java.util.Arrays?) such that the access is not illegal any more?

Loading bignlp: Warning "Registered S3 method overwritten"

Upon loading bignlp, we recently get the following warning message:

Registered S3 method overwritten by 'coreNLP':
  method           from    
  print.annotation cleanNLP

I guess it is not a substantial problem, but it should be avoided.

Input in AnnotationList in dev branch

I have three rather small observations concerning the new promising looking annotation workflow:

Line 31 indicates an object called "s" but there is none: Should be x, shouldn't it?

bignlp/R/AnnotationList.R

Line 31 in 872ff58

} else if (is(s) == "jobjRef"){

This will produce an error if the object is neither a character vector nor a list.

Line 34 starts an empty else. Currently, the return value of a missing input is essentially a NULL object. If that's intended, then the current structure might be fine.

bignlp/R/AnnotationList.R

Line 34 in 872ff58

} else {

On a really minor note concerning the same class, there is a typo in the documentation:

bignlp/R/AnnotationList.R

Line 21 in 872ff58

#' @description Initialize AnnotationPipeline

This should read AnnotationList, shouldn't it?

Setting the _Output format

No colnames of CoNLL output format

But this is highly standardized:
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/CoNLLOutputter.html

corenlp_annotate on logging branch without (or with irritating) progress indication

If I understand correctly, you implemented a more robust parallelization workflow on the current "logging" branch by actively triggering the java garbage collection. On first glance, this seems to effectively limit memory usage per core which is good (here no guarantees just yet as the data I used wasn't that large). However, I noticed that the actual annotation step when used as follows in the console

ndjson_files <- corenlp_annotate(
  input = tsv_files,
  output = ndjson_files,
  threads = no_cores,
  byline = TRUE,
  method = "json",
  progress = FALSE
)

looks rather scary because the terminal "scrolls" through an infinitive number of invisible lines while calculating, looking like it crashed until the console prompt comes back after a while.

I didn't test progress = TRUE because this was a tricky option to use in earlier versions. But I think with progress = FALSE, the console output should be omitted somehow because this is rather irritating.

superfluous quotation marks | remove for parallelized option as well

With c1c422b you fixed an issue concerning superfluous quotation marks. However, this only fixes half of the issue. When using more than one core, it still persists. The same fix must be applied in the else if (threads >= 2L) part of corenlp_annotate as well.

if (grepl('^".*"$', chunk_data[["text"]])) chunk_data[["text"]] <- gsub('^"(.*)"$', "\\1", chunk_data[["text"]])

Show Java encoding upon initialization

The package assumes that Java will work with UTF-8 but this is not necessary. Check it upon attaching bignlp using this snippet:

library(rJava)
.jcall("java/lang/System", "S", "getProperty", "file.encoding")

Erroneous error messages and warnings

Again something small I noticed: Two cli alerts and warnings are faulty because the sprintf call which makes the %s work is missing.

bignlp/R/StanfordCoreNLP.R

Line 101 in 82a6690

cli_alert_warning("Java version: %s - recommended: 1.8", jvm_version())

bignlp/R/StanfordCoreNLP.R

Line 103 in 82a6690

cli_alert_success("Java version: %s", jvm_version())

Use Java Parallelization

library(cwbtools)
library(data.table)
library(rJava)

.jinit()
.jaddClassPath(Sys.glob("/opt/stanford-corenlp/stanford-corenlp-full-2018-10-05/*.jar"))

properties <- .jnew("java.util.Properties")
properties$put("annotators", "tokenize, ssplit, pos, lemma")

corenlp <- .jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", properties)

unga_xml_files <- list.files(system.file(package = "bignlp", "extdata", "xml"), full.names = TRUE)
CD <- CorpusData$new()
CD$import_xml(filenames = unga_xml_files)

array_list <- .jnew("java.util.ArrayList")
dummy <- sapply(CD$chunktable[["text"]], function(x) array_list$add(.jnew("edu.stanford.nlp.pipeline.Annotation", .jnew("java.lang.String", x))))

corenlp$annotate(array_list, numThreads = 7L)

json_outputter <- rJava::.jnew("edu.stanford.nlp.pipeline.JSONOutputter")
ndjson <- lapply(0L:(array_list$size() - 1L), function(i) json_outputter$print(array_list$get(i)))
y <- lapply(ndjson, jsonlite::fromJSON)

Vignette | Minor Remarks

These are small things I noticed and which do not warrant an issue on their own.

library(bignlp)

library(bignlp) is called twice which isn't harmful but unnessary (line 67 and 85).

props in Workflow 3

You present three different workflows for different use cases. All three workflows call "props" when initialized. In the first two these are explicitly loaded from the package and configured, depending on the workflow. The third workflow calls the prop without saying where it comes from. While it is possible to conclude from the StanfordCoreNLP class that it should be the first method of calling and setting up props (including properties_set_threads(props, no_cores)), it might be useful to be explicit here.

Explaining annotate

I think it might be interesting to learn about the thread argument in Pipe$annotate(alist) and why it isn't passed explicitly in line 232. The documentation of the annotate method says:

threads If NULL, all available threads are used, otherwise an integer value with number of threads to use.

This is true, it will use all the threads by default (depending on the kind of annotators you use). But in my configuration, setting it to Pipe$annotate(annoli, 2L) the process still seems to use more cores at times. I assume that different annotators in the pipeline are using different number of cores. So maybe the "threads" argument does not control that globally?

If this is a substantial issue, I can also submit it accordingly.

propslist (lines 274ff)

I assume this is just an example, but it shouldn't work because the German models don't include lemma.

German Properties file

At line 443 you talk about the German properties file. There, in the file itself it might be useful to either add the annotators for "pos" and "ner" or to mention that they are turned off by the file that is provided by the package and can be added in the file manually.

bignlp/inst/extdata/properties_files/corenlp-german-fast.properties

Line 2 in 8e0fcfe

annotators = tokenize, ssplit

`corenlp_parse_conll()` throws warning

This is due to a call to file.exists() to check whether incoming object is a file. But this may result in a warning.

conll_str <- "1\t„\t_\tPUNCT\tO\t_\t_\n2\tDie\t_\tDET\tO\t_\t_\n3\tweiteren\t_\tADJ\tO\t_\t_\n4\tbürokratischen\t_\tADJ\tO\t_\t_\n5\tHürden\t_\tNOUN\tO\t_\t_\n6\t,\t_\tPUNCT\tO\t_\t_\n7\tdie\t_\tPRON\tO\t_\t_\n8\tder\t_\tDET\tO\t_\t_\n9\tGesetzesentwurf\t_\tNOUN\tO\t_\t_\n10\tdes\t_\tDET\tO\t_\t_\n11\tBundesinnenministers\t_\tNOUN\tO\t_\t_\n12\tmit\t_\tADP\tO\t_\t_\n13\tsich\t_\tPRON\tO\t_\t_\n14\tbringen\t_\tVERB\tO\t_\t_\n15\twürde\t_\tAUX\tO\t_\t_\n16\t,\t_\tPUNCT\tO\t_\t_\n17\twären\t_\tAUX\tO\t_\t_\n18\tfür\t_\tADP\tO\t_\t_\n19\tdie\t_\tDET\tO\t_\t_\n20\tbetroffenen\t_\tADJ\tO\t_\t_\n21\tJugendlichen\t_\tNOUN\tO\t_\t_\n22\tunzumutbar\t_\tADJ\tO\t_\t_\n23\t\"\t_\tPUNCT\tO\t_\t_\n24\t,\t_\tPUNCT\tO\t_\t_\n25\tso\t_\tADV\tO\t_\t_\n26\tKolat\t_\tPROPN\tPERSON\t_\t_\n27\tweiter\t_\tADJ\tO\t_\t_\n28\t.\t_\tPUNCT\tO\t_\t_\n\n1\tDer\t_\tDET\tO\t_\t_\n2\tEntwurf\t_\tNOUN\tO\t_\t_\n3\tdes\t_\tDET\tO\t_\t_\n4\tBundesinnenministers\t_\tNOUN\tO\t_\t_\n5\tsei\t_\tAUX\tO\t_\t_\n6\tnach\t_\tADP\tO\t_\t_\n7\tMeinung\t_\tNOUN\tO\t_\t_\n8\tvieler\t_\tPRON\tO\t_\t_\n9\tJuristen\t_\tNOUN\tO\t_\t_\n10\tauch\t_\tADV\tO\t_\t_\n11\teuroparechtswidrig\t_\tADJ\tO\t_\t_\n12\tund\t_\tCCONJ\tO\t_\t_\n13\terhöhe\t_\tADJ\tO\t_\t_\n14\tsogar\t_\tADV\tO\t_\t_\n15\tden\t_\tDET\tO\t_\t_\n16\tVerwaltungsaufwand\t_\tNOUN\tO\t_\t_\n17\t.\t_\tPUNCT\tO\t_\t_\n\n"

file.exists(con_str)

Solution: Move to check for file ending *.conll.

corenlp_annotate,character-method: Argument progress ineffective

The progress bar is shown even if progess = FALSE.

Different results of byline and in-memory processing

This is an example of Christoph Leonhardt that the two different methods described in the (new) package vignette may yield different results.

options(java.parameters = "-Xmx4g")

library(bignlp)
library(data.table)
library(magrittr)

sample_dt <- data.table(
  id = 1:2,
  text = c(
    "Vielen Dank. – Herr Präsident! Meine Damen und Herren! Am 25. März haben wir hier das COVID‑19-Insolvenzaussetzungsgesetz beschlossen, um die gesetzliche Pflicht zur Stellung eines Insolvenzantrags für Unternehmen, die durch die staatlichen Coronamaßnahmen zahlungsunfähig oder überschuldet werden, für sechs Monate auszusetzen. Wir als AfD-Fraktion haben diesem Gesetz damals zugestimmt. Die Begründung dafür ist auch rückblickend richtig – Zitat –:",
    "Vielen Dank. – Herr Präsident! Meine Damen und Herren! Am 25. März haben wir hier das COVID‑19-Insolvenzaussetzungsgesetz beschlossen, um die gesetzliche Pflicht zur Stellung eines Insolvenzantrags für Unternehmen, die durch die staatlichen Coronamaßnahmen zahlungsunfähig oder überschuldet werden, für sechs Monate auszusetzen. Wir als AfD-Fraktion haben diesem Gesetz damals zugestimmt. Die Begründung dafür ist auch rückblickend richtig – Zitat –:")
)

First the approach using temporary files.

props_file <- corenlp_get_properties_file(lang = "de", fast = TRUE)
props <- properties(x = props_file) %>% 
  properties_set_output_format("conll") %>%
  properties_set_threads(parallel::detectCores() - 1L)

Pipe <- StanfordCoreNLP$new(properties = props, output_format = "conll")
Pipe$verbose(FALSE)

segdirs <- segment(x = sample_dt, dir = (nlpdir <- tempdir()), chunksize = 1)
conll_files <- lapply(segdirs, Pipe$process_files)
Sys.sleep(0.5) # Java may still be working while R is moving on - then files are missing
df1 <- rbindlist(lapply(unlist(conll_files), fread, blank.lines.skip = TRUE, quote = "", header = FALSE))

Now using the in-memory approach ...

df2 <- corenlp_annotate(
  sample_dt,
  properties = properties(props_file),
  progress = FALSE
)

But now df2 is shorter - see where the tokenstreams are different.

df3 <- cbind(df1[, 2], df2[, 3])
DT::datatable(df3[20:24,], colnames = c("split-apply-combine", "Multithreading without temporary files"))

Suggestion | corenlp.R install French

Not sure if it makes sense to hard code this information into the R code, but if you do it for German and English, one easily could do it for French as well. This functionality would be provided in R/corenlp.R.

fr = function(){
message("... installing model files for: French")
french_jar_url <- "http://nlp.stanford.edu/software/stanford-french-corenlp-2018-10-05-models.jar"
french_jar <- file.path(corenlp_dir, "stanford-corenlp-full-2017-06-09", basename(french_jar_url))
download.file(url = french_jar_url, destfile = french_jar)
unzip(french_jar, files = "StanfordCoreNLP-french.properties")
zip(zipfile = french_jar, files = "StanfordCoreNLP-french.properties", flags = "-d")
     }

corenlp_parse_conll | Hashtags in input cause trouble in read.table

Problem

If there is a literal hashtag ("#") in the input of corenlp_parse_conll(), read.table will treat it as a comment.

bignlp/R/output.R

Line 117 in 872ff58

    
           read.table(text = x, blank.lines.skip = TRUE, header = FALSE, sep = "\t", quote = "")

Example

Let the sentence "Some unannotated example text with a hashtag #HashtagExample" be the input for core_parse_conll(). If it would really be generated by bignlp this would look like:

x <- "1\tSome\t_\t_\t_\t_\t_\n2\tunannotated\t_\t_\t_\t_\t_\n3\texample\t_\t_\t_\t_\t_\n4\ttext\t_\t_\t_\t_\t_\n5\twith\t_\t_\t_\t_\t_\n6\ta\t_\t_\t_\t_\t_\n7\thashtag\t_\t_\t_\t_\t_\n8\t#HashtagExample\t_\t_\t_\t_\t_\n\n"

This will cause trouble because the number of columns gets mixed up because the "#HashtagExample" is treated like a comment instead of an ordinary character vector.

dt <- as.data.table(
  read.table(text = x, blank.lines.skip = TRUE, header = FALSE, sep = "\t", quote = "")
)

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 8 did not have 7 elements

Potential Solution

I guess, the solution is to turn off the comment.char altogether in read.table()

dt <- as.data.table(
  read.table(text = x, blank.lines.skip = TRUE, header = FALSE, sep = "\t", quote = "", comment.char = "")
)

Stanford NLP installation paths lead nowhere in versions older than 4.0.0

In bignlp Stanford NLP core and its language models have to be installed. bignlp does that for you. However the storage locations on the Stanford website which are hard coded in the download function of bignlp have been changed or removed for versions of Stanford NLP which are older than the new 4.0.0. The Stanford NLP website still contains the old links which lead nowhere.

This might be a temporary problem, but for now paths should be changed in corenlp.R.

Edit: It is also necessary to change the paths in the property file.

Effect of 'prettyPrint'

There is a the property 'prettyPrint': What is the effect of setting it to true/false on JSON output? Will linebreaks be omitted, reducing the file size?

Explain installation of CoreNLP in README

As things are, the installation is explained at the end of the vignette - this is not where you would look for it.

Minor remarks for the documentation of corenlp_annotate()

The documentation of corenlp_annotate() includes two consecutive calls of the function:

bignlp/R/corenlp.R

Lines 52 to 65 in e6a6bda

    
           #' y <- corenlp_annotate( 
        
           #'   x = reuters_dt, 
        
           #'   pipe = props, 
        
           #'   corenlp_dir = corenlp_get_jar_dir(), 
        
           #'   progress = FALSE 
        
           #' ) 
        
           #'  
        
           #' y <- corenlp_annotate( 
        
           #'   x = reuters_dt, 
        
           #'   pipe = props, 
        
           #'   threads = TRUE, 
        
           #'   corenlp_dir = corenlp_get_jar_dir(), 
        
           #'   progress = FALSE 
        
           #' )

Two remarks here:

I think that the second call is not up-to-date: wouldn't the argument threads be an integer instead of boolean? This is what the rest of the documentation explicitly states.
I am not sure how these two calls of the same function would differ. If the second call should showcase the threads argument, then this could be explained.

Another minor remark would be that in the documentation for "Scenario 2b" it is necessary to instantiate the Pipe object. This is done in "Scenario 2a" but if it should be possible to run the examples independently, the object is missing. It might be redundant, but I think it would be clearer if the Pipe object is instantiated in "Scenario 2b" as well.

corenlp_annotate() | creating new nodes removes potential attributes

Problem

Using corenlp_annotate() in bignlp version 0.1.3.9002 it is necessary to identify nodes in which the actual textual content can be found. This is done via the argument xpath which defaults to "\\p". The text of these nodes is retrieved, passed to the annotation pipeline and finally, the name of the initial node is added to the annotated data.

A remaining gap is that in consequence only the name of the node queried by the xpath and its text is kept while potential attributes are dropped silently.

I think that the attributes should be added back to the new nodes.

Possible Solution

One reasonable solution might be to add

new_nodes <- xml_find_all(xml_doc_tmp, xpath = xpath)
xml2::xml_attrs(new_nodes) <- sapply(text_nodes, xml_attrs)

after this following existing chunk:

bignlp/R/corenlp.R

Lines 338 to 343 in e6a6bda

    
           xml_doc_tmp <- read_xml( 
        
             x = charToRaw(enc2utf8(xml_doc_char)), 
        
             encoding = "UTF-8", 
        
             as_html = FALSE, 
        
             options = opts 
        
           )

At this point, the nodes are back as XML and adding attributes from the original text nodes should be fast and robust as long as the annotation pipeline indeed returns all text nodes (empty text nodes were removed earlier, so this should not be an issue) and does so in the correct order.

Windows CI

I temporarily took out CI for windows from R-CMD-check.yaml by removing this line:

          - {os: windows-latest, r: 'release'}

The package is not yet ready/tested to run on Windows. CI for Windows should be activated again once we've looked into this.

Stanford CoreNLP 4.0.0 | German NER model leads to broken json

The following issue does not apply to the current default version of bignlp. The Stanford CoreNLP version downloaded there (3.9.2) still works out of the box.

However, after updating to Stanford CoreNLP 4.0.0 and using the German NER Model (german.distsim.crf.ser.gz), in the current workflow corenlp_annotate() returns ndjsons which cannot be parsed by corenlp_parse_ndjson(). The culprit is the broken formatting of the nerConfidences attribute in the output.

...
"nerConfidences": { "PERSON": 0,9999 }
...

With the comma decimal separator, the parser expects a string value. For now, I applied a quick fix replacing the comma before parsing the ndjson.

x <- readLines(ndjson_files)
x2 <- gsub("\\s(\\d*),(\\d+)", " \\1\\.\\2", x)
writeLines(x2, ndjson_files)

I will leave this issue in case the default version of bignlp is updated in the future and the error resurfaces.

Untokenizable token not previously encountered

This is a warning I recently see. The character should be dealt with by the preprocessor.

[main] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: � (U+9D, decimal: 157)

No Output for corenlp_parse_ndjson

tsv_files_tagged <- corenlp_parse_ndjson(
input = ndjson_files,
cols_to_keep = c(„id“, p_attrs),
output = tsv_files_tagged,
threads = no_cores,
byline = TRUE,
progress = TRUE,
verbose = TRUE # this doesn't change anything
)

Parsing 10 ndjson-files with a size of 10 gigabytes each, this code doesn't produce any output, neither in the console nor in R Studio. After seven minutes of running, no output file was produced either when running in R Studio.

pad output names of segment()

I would propose a small change to the following part of the segment() function of the javamultithreading branch because otherwise the order of CoNLL-files might easily get mixed up when they are read back in later on:

      f <- file.path(file.path(dir, i, sprintf("%d.txt", chunks[[i]][["id"]][j])))

in segment() could be changed to pad the number in the filename with zeros according to the length of chunksize such as:

      f <- file.path(file.path(dir, i, sprintf("%0*d.txt", nchar(chunksize),  chunks[[i]][["id"]][j])))

I think this should make it easier to sort by filename to ensure the correct order.

Edit: The number of characters in chunksize was chosen because it seemed that the scenario I was describing will occur mostly in the very first chunk (which will be prevented by the approach above), but that doesn't need to be true. A more universal solution would be to pad all file names according to the nchar of the final ID in the input table.

Unknown untokenizable character

[main] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ‬ (U+202C, decimal: 8236)

Parallel processing using AnnotatorPipeline$annotate()

This is just a quick proof of concept how calling the annotate() method on iterable objects might work:

library(rJava)
library(bignlp)

S <- StanfordCoreNLP$new(
  properties = corenlp_get_properties_file(lang = "de"),
  output_format = "json"
)

anno1 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java.lang.String", "Das ist ein Satz.")
  )
anno2 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java.lang.String", "Das ist anderer Satz.")
)
anno3 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java.lang.String", "Das ist anderer Satz. Und noch ein Satz")
)

arr <- .jarray(list(anno1, anno2, anno3))
a <- .jnew("java.util.Arrays")$asList(arr)

# it <- .jnew("edu.stanford.nlp.util.Iterables")
# i <- it$chain(a)

props <- bignlp::properties(corenlp_get_properties_file(lang = "en", fast = TRUE))
s <- .jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)
s$annotate(i)

json_outputter <- .jnew("edu.stanford.nlp.pipeline.JSONOutputter")
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", anno3))

# or more low-level

pl <- .jnew("edu.stanford.nlp.pipeline.AnnotationPipeline")
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.TokenizerAnnotator"))
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.WordsToSentencesAnnotator"))
pl$annotate(a)

Output files in chunk_table_split

In order to specify the output location of the chunk_table_split method, it is necessary to provide the name and path of n target files, which seems unnecessarily complicated. A more practical solution would be to provide an output directory in which n target files are stored instead.

something like (not tested):

output <- "/hd/cl_tmp/"

  if (!is.null(output)){
    output <- file.path(
      output,
      paste(
        gsub("^(.*?)\\..*?$", "\\1", basename(input)),
        "_", 1L:n,
        gsub("^.*(\\..*?)$", "\\1", basename(input)),
        sep = ""
      )
    )

Parsing conll output breaks if input/output includes "#"

The # sign is interpreted as a comment and causes an error here:

read.table(text = x, blank.lines.skip = TRUE, header = FALSE, sep = "\t", quote = "")

The obvious solution is to add comment.char = ""

Coping with the rJava installation

The bignlp package is inconceivable without using rJava as an interface to Stanford CoreNLP. But rJava is notoriously difficult to install. The this thread if the R-devel discussion forum for a respective discussion. We should try to give effective instructions as possible in the README to ameliorate the pain installing rJava may cause.

License of REUTERS data

The pkg now includes a minimal documentation of the REUTERS data that is included as sample data (see ?reuters_dt). As we seek to prepare the package for further publication, we should look into the licensing conditions and explain those in the documentation.

Process data that has already been tokenized

Exploring the code of CoreNLP to gain a better understanding how to process strings directly infused from R I realized that the AnnotationPipeline class, though more low-level than the StanfordCoreNLP allows much more control which annotators to use. In particular, you can process Annotationclass objects (in parallel), and you can be very specific about the input data and the annotators.

I am not yet able to spell out how it would work in detail, but it might be (should be) possible to infuse data such that you do not have to start the annotation from scratch.

library(rJava)
library(bignlp)

# remains unused - just ensures that CoreNLP is on classpath
foo <- StanfordCoreNLP$new(
  properties = corenlp_get_properties_file(lang = "de"),
  output_format = "json"
)

anno1 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java/lang/String", "This is some text to annotate. And here comes another sentence!")
)

anno2 <- .jnew(
  "edu/stanford/nlp/pipeline/Annotation",
  .jnew("java/lang/String", "We want to test whether this works. Would be great.")
)

al <- .jnew("java.util.ArrayList", J("edu/stanford/nlp/pipeline/Annotation"))
al$add(anno1)
al$add(anno2)


pl <- .jnew("edu.stanford.nlp.pipeline.AnnotationPipeline")
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.TokenizerAnnotator"))
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.WordsToSentencesAnnotator"))
pl$addAnnotator(.jnew("edu.stanford.nlp.pipeline.POSTaggerAnnotator"))

pl$annotate(al)

json_outputter <- .jnew("edu.stanford.nlp.pipeline.JSONOutputter")
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", al$get(0L)))
cat(.jcall(json_outputter, "Ljava/lang/String;", "print", al$get(1L)))

Memory Usage with large data

Tagging a chunkdata-file of 3,5 GB with one core and the following code:

CD$tokenstream <- chunk_table_split(chunkdata_file, output = NULL, n = no_cores, verbose = TRUE) %>% corenlp_annotate(threads = no_cores, byline = TRUE, progress = interactive()) %>% corenlp_parse_ndjson(cols_to_keep = c("id", p_attrs), output = tsv_file_tagged, threads = no_cores, progress = interactive()) %>% lapply(fread) %>% rbindlist()

results in a usage of about 22 GB of RAM by R although JVM is initialized with a limit of 4GB:

options(java.parameters = "-Xmx4g")

If the same operation is performed with more cores (i.e. the chunk file is split), each R process (one per core) will use around 22GB of RAM.

Edit: Just checked the package version: This problem occurs on bignlp 0.5.0. There is one newer version (0.6.0). However, I don't see how the changes would affect this behavior.

Hard coded setting of JVM heap space

The file zzz.R now includes this line:

options(java.parameters = "-Xmx4g")

It is a shortcut to pass R CMD check, to assure that the JVM is sufficiently sized. Yet this solution / hack is certainly a violation of good policy.

Use CoreNLP parallelization

props <- rJava::.jnew("java.util.Properties")
props$setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
props$setProperty("tokenize.language", "de")
props$setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/german-ud.tagger")
props$setProperty("ner.model", "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz")
props$put("threads", "6")
tagger <- rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", props)
jsonifier <- rJava::.jnew("edu.stanford.nlp.pipeline.JSONOutputter")
system.time(anno <- rJava::.jcall(tagger, "Ledu/stanford/nlp/pipeline/Annotation;", "process", "Das ist ein Satz."))
json_string <- rJava::.jcall(jsonifier, "Ljava/lang/String;", "print", anno)

Potential integer column type of parsed ConLL output

This is a documentation of an imminent bug fix. Annotating a corpus I saw the error

Error in [.data.table(x, , { :
Column 2 of result for group 4421 is type 'integer' but expecting type 'character'. Column types must be consistent for each group.

The error results from a piece of text that is a character vector "-9458". It yields the conll output:

"1\t-9458\t_\tNUM\tO\t_\t_\n\n"

The chosen approach to parse the ConLL output is using the read.table() function, which guesses the vector type of columns. The second column in this case is an integer vector, causing the error later on.

Easy to fix with argumen colClasses:

dt <- as.data.table(
       read.table(
          text = x,
          blank.lines.skip = TRUE,
          header = FALSE,
          sep = "\t", quote = "", comment.char = "",
          colClasses = c("integer", "character", "character", "character", "character", "character", "character")
        )
      )
)

	#' y <- corenlp_annotate(
	#' x = reuters_dt,
	#' pipe = props,
	#' corenlp_dir = corenlp_get_jar_dir(),
	#' progress = FALSE
	#' )
	#'
	#' y <- corenlp_annotate(
	#' x = reuters_dt,
	#' pipe = props,
	#' threads = TRUE,
	#' corenlp_dir = corenlp_get_jar_dir(),
	#' progress = FALSE
	#' )

	xml_doc_tmp <- read_xml(
	x = charToRaw(enc2utf8(xml_doc_char)),
	encoding = "UTF-8",
	as_html = FALSE,
	options = opts
	)

polmine / bignlp Goto Github PK

bignlp's People

Contributors

Stargazers

Watchers

Forkers

bignlp's Issues

library(bignlp)

props in Workflow 3

Explaining annotate

propslist (lines 274ff)

German Properties file

Problem

Example

Potential Solution

Problem

Possible Solution

Recommend Projects

Recommend Topics

Recommend Org