polmine / cookbook Goto Github PK
View Code? Open in Web Editor NEWRecipes for cooking with cwbtools and polmineR
Recipes for cooking with cwbtools and polmineR
The vignette on sentence annotation explains how you can use an existing POS annotation with the STTS to generate an annotation of sentences.
This chunk of code explains what you can do to use openNLP. However, it is somewhat slow at the very end. This is why I hesitate to integrate it into the vignette.
library(RcppCWB)
library(NLP)
library(openNLP)
corpus_size <- cl_attribute_size("UNGA", attribute = "word", attribute_type ="p")
cpos <- 0L:(corpus_size - 1L)
ids <- cl_cpos2id("UNGA", p_attribute = "word", cpos = cpos)
word <- cl_id2str("UNGA", p_attribute = "word", id = ids)
whitespace_after <- c(ifelse(word %in% c(".", ",", ":", "!", "?", ";"), FALSE, TRUE)[2L:length(word)], FALSE)
word_with_whitespace <- paste(word, ifelse(whitespace_after, " ", ""), sep = "")
s <- String(paste(word_with_whitespace, collapse = ""))
word_length <- sapply(word, nchar)
left_offset <- c(1L, (cumsum(sapply(word_with_whitespace, nchar)) + 1L)[1L:(length(word) - 1L)] )
right_offset <- left_offset + word_length - 1L
word_annotation <- NLP::Annotation(
id = cpos,
rep.int("word", length(cpos)),
start = left_offset,
end = right_offset
)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
sentence_annotation <- annotate(s, sent_token_annotator)
a <- c(word_annotation, sentence_annotation)
sentences_cpos <- lapply(annotations_in_spans(a[a$type == "word"], a[a$type == "sentence"]), function(a) a$id)
region_matrix <- do.call(rbind, lapply(sentences_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)])))
Preparing a (proper) release of the GermaParl package, I removed this function from the package which defines something like a simple query engine. A more generic implementation (i.e. which is not exclusively focused on GermaParl) is easy to conceive. However, it should be considered, whether this would bloat the polmineR package.
#' Query GermaParl
#'
#'
#' @param cnt XXX
#' @param p_attribute XXX
#' @param min_size XXX
#' @import Matrix
#' @importFrom stats setNames
#' @importFrom slam row_sums
#' @importFrom polmineR as.sparseMatrix
#' @examples
#' \dontrun{
#' P <- partition("GERMAPARL", cap = "^.*\\|8-01\\|.*$", regex = TRUE)
#' C <- count(P, p_attribute = c("word", "pos"))
#' CNT <- as(C, "count")
#' matches <- query(cnt = CNT, min_size = 250)
#' PB <- partitionBundle("GERMAPARL", sAttribute = "speech", values = names(matches)[1:20])
#' }
#' @export germaparl_search_speeches
germaparl_search_speeches <- function(cnt, p_attribute, min_size = 250){
if (requireNamespace("qlcMatrix", quietly = TRUE)){
dtm_file <- system.file(package = "GermaParl", "extdata", "dtm", sprintf("dtm_%s.rds", p_attribute))
dtm <- readRDS(file = "~/Lab/tmp/dtm.rds") # ~ 3 secs
dtm$i <- as.integer(c(dtm$i, rep(x = nrow(dtm) + 1, times = nrow(cnt))))
dtm$v <- as.integer(c(dtm$v, cnt[["word"]]))
dtm$j <- as.integer(c(dtm$j, cnt[["word_id"]]))
dtm$nrow <- as.integer(as.integer(dtm$nrow + 1L))
dtm$dimnames[[1]] <- c(dtm$dimnames[[1]], "search_vector")
dtm_subset <- if (!is.null(min_size)) dtm[which(row_sums(dtm) >= 250),] else dtm
dtm_weighed <- weigh(dtm_subset, method = "tfidf")
M <- t(as.sparseMatrix(dtm_weighed))
query <- setNames(as.vector(dtm_weighed["search_vector",]), colnames(dtm_weighed))
query <- query[order(query, decreasing = TRUE)]
simMatrix <- qlcMatrix::cosSparse(X = M[, 1L:(ncol(M) - 1L)], Y = Matrix(as.matrix(M[,ncol(M)])))
simVector <- setNames(simMatrix[,1], rownames(simMatrix))
simVectorOrdered <- simVector[order(simVector, decreasing = TRUE)]
return(simVectorOrdered)
} else {
stop("package 'qlcMatrix' required but not available")
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.