databricks / spark-corenlp Goto Github PK

Stanford CoreNLP wrapper for Apache Spark

License: GNU General Public License v3.0

Scala 100.00%

spark-corenlp's Introduction

Stanford CoreNLP wrapper for Apache Spark

This package wraps Stanford CoreNLP annotators as Spark DataFrame functions following the simple APIs introduced in Stanford CoreNLP 3.7.0.

This package requires Java 8 and CoreNLP to run. Users must include CoreNLP model jars as dependencies to use language models.

All functions are defined under com.databricks.spark.corenlp.functions.

cleanxml: Cleans XML tags in a document and returns the cleaned document.
tokenize: Tokenizes a sentence into words.
ssplit: Splits a document into sentences.
pos: Generates the part of speech tags of the sentence.
lemma: Generates the word lemmas of the sentence.
ner: Generates the named entity tags of the sentence.
depparse: Generates the semantic dependencies of the sentence and returns a flattened list of (source, sourceIndex, relation, target, targetIndex, weight) relation tuples.
coref: Generates the coref chains in the document and returns a list of (rep, mentions) chain tuples, where mentions are in the format of (sentNum, startIndex, mention).
natlog: Generates the Natural Logic notion of polarity for each token in a sentence, returned as up, down, or flat.
openie: Generates a list of Open IE triples as flat (subject, relation, target, confidence) tuples.
sentiment: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4 (strong positive).

Users can chain the functions to create pipeline, for example:

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

val input = Seq(
  (1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

output.show(truncate = false)

+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|sen                                           |words                                                 |nerTags                                           |sentiment|
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[ORGANIZATION, ORGANIZATION, O, O, O, LOCATION, O]|1        |
|It is a great university .                    |[It, is, a, great, university, .]                     |[O, O, O, O, O, O]                                |4        |
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+

Databricks

If you are a Databricks user, please follow the instructions in this example notebook.

Dependencies

Because CoreNLP depends on protobuf-java 3.x but Spark 2.4 depends on protobuf-java 2.x, we release spark-corenlp as an assembly jar that includes CoreNLP as well as its transitive dependencies, except protobuf-java being shaded. This might cause issues if you have CoreNLP or its dependencies on the classpath.

To use spark-corenlp, you need one of the CoreNLP language models:

# Download one of the language models. 
wget http://repo1.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.9.1/stanford-corenlp-3.9.1-models.jar
# Run spark-shell 
spark-shell --packages databricks/spark-corenlp:0.4.0-spark_2.4-scala_2.11 --jars stanford-corenlp-3.9.1-models.jar

Acknowledgements

Many thanks to Jason Bolton from the Stanford NLP Group for API discussions.

spark-corenlp's People

Contributors

Stargazers

Watchers

Forkers

mengxr cfregly cequencer viveksck codeaudit copyfun vparikh10 lucentcosmos maalmeida1837 shanfei bigdatafly msdevanms arcodergh wuzhongdehua snazz2001 jiekechoo desperado1992 lihait bailu0713 qingniufly flowers2023 lipengyu giserh jtmurphy89 sjtu2008 maogautam weichenxu123 nagabharat johncliu nsrinivasapps zjsjack m-ri slothspot joseluiscampos raghav-mw stephanesbizzera gdtm86 alvincjin ravirajadrangi fatmas1982 tspannhw zuxfoucault babooppa6 hanhanwu hevensun borislin venkateshk-sr xiaohuanit miaochenal steccami geoheil yilab apsaltis john-min xiyuanhou a3digit ry1031 havnar durgaprasd zouzias frankfqchen qingqingqing biaoma-ty skyformat99 ivesbai caitlindong rockie-yang wangzhiqun kioco frmh0313 mrestuccia reynoldsm88 chenkovsky linghongtao lizhizhou vincentpanqi mallik-g ikwattro l748198943 liguo86 xushjie1987 hitanjan ljodea nekonyuu athertoncapital solversa jkbradley srinivasannagarajarao chengli0327 varunkotte6 aipachakutiqwan abiraja2004 mmmika hulalazz kholohan wxrui ranglang zachary4biz ilibx pushkarsinha

spark-corenlp's Issues

Implemetation takes really long for giving putput

Hi, I am using this library but am getting extremely slow results. For 10k records containing some texts, it has taken longer than 16 hours to process 160 tasks out of 1920 after re-partitioning. I am wonder if the name extraction is working parallely or do other executors queue one after the other for name entity recognition to happen. Python non-parallel scripts seem to work faster than this. Any suggestion, work arounds would be highly appreciated

Is it possible to use this with pyspark?. How to do it?

java.lang.NoSuchMethodError with scala/spark 2.10

I'm getting the same error that raghugvt posted here. He solved the problem by bundeling everything together in one jar, however thats not an option as I would like to use spark-corenlp in a notebook.

My build.sbt is as follows:


version := "1.0"

scalaVersion := "2.10.6"

resolvers += "Spark Packages Repository" at "https://dl.bintray.com/spark-packages/maven/"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "2.1.0",
  "org.apache.spark" % "spark-sql_2.10" % "2.1.0",
  "com.databricks" % "spark-csv_2.10" % "1.5.0",
  "org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)

libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.7.0" withSources() withJavadoc()
libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.7.0" classifier "models"
libraryDependencies += "databricks" % "spark-corenlp" % "0.2.0-s_2.11"

I'm testing with this script:

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder().master("local")
  .appName("Spark SQL basic example")
  .config("master", "spark://myhost:7077")
  .getOrCreate()

val sqlContext = spark.sqlContext

import sqlContext.implicits._

val input = Seq(
  (1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

output.show(truncate = false)

Which results in the error:

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
	at com.databricks.spark.corenlp.functions$.cleanxml(functions.scala:54)

What is going wrong here?

Compose failure with unhelpful error message

Wrong project

Using from Databricks Community Edition

I'm trying to use this library from a Databricks CE notebook, by adding the Stanford NLP library as well as this library as dependencies. However, I get an error indicating that this library is not recognized.

Do you know if it is possible to consume this library using Databricks Community Edition notebook?

How to use this lib in java

Hi all,
Would you tell me how to use this lib in java?
I can import it, but i do not know to use.
"import com.databricks.spark.corenlp.functions;"
The maven project pom.xml:

databricks
spark-corenlp
0.2.0-s_2.11

In the following, there is my program.
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
StructField[] structFields = new StructField[]{
new StructField("intColumn", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("stringColumn", DataTypes.StringType, true, Metadata.empty())
};

StructType structType = new StructType(structFields);

List rows = new ArrayList<>();
rows.add(RowFactory.create(1, "Stanford University is located in California. " +
"It is a great university."));

Dataset df = spark.createDataFrame(rows, structType);

System.out.println("Test Count = " + df.count());

df.show();

Rick

Is it possible to use custom training dataset for sentiments?

Hello,

Is it possible to use custom training dataset for sentiments.?
Also If there is possible for correcting the sentiments and backfeeding again if sentiment is wrong?

Thanks,
Mahesh

Bintray shutting down

Where are the releases published to? Are they on bintray, and if so what is the plan for the sunset on May 1st?

See: graphframes/graphframes#384
https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/

@mengxr
@slothspot
@geoHeil
@auroredea

NoSuchElementExeception when run large dataset

When I limit dataset size 100, it works well, but when dataset is large, it crashes program, here are some exeception,could you please give me some suggest...
java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:854)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at scala.collection.AbstractIterable.head(Iterable.scala:54)
at com.databricks.spark.corenlp.functions$$anonfun$sentiment$1.apply(functions.scala:163)
at com.databricks.spark.corenlp.functions$$anonfun$sentiment$1.apply(functions.scala:158)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/11/14 14:32:39 ERROR DefaultWriterContainer: Task attempt attempt_201611141429_0000_m_000001_0 aborted.
16/11/14 14:32:39 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

How to index Spark CoreNLP analysis?

I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.

val input = Seq(
(1, "Apple is located in California. It is a great company."),
(2, "Google is located in California. It is a great company."),
(3, "Netflix is located in California. It is a great company.")
).toDF("id", "text")

input.show()

input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id| text|
+---+--------------------+
| 1|Apple is loc...|
| 2|Google is lo...|
| 3|Netflix is l...|
+---+--------------------+
I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.

val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
However, in the output below i have lost the connection back to the original dataframe row ids.

+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+
I have tried to create a UDF but am unable to make it work.

Java 8 Requirement

Hi there, I just wonder why this package requires Java 8? Is it a hard requirement because it uses Java 8 features? Thanks!

Failed on empty strings - sentiment analysis

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

val input = Seq(
  (1, "Stanford University is located in California. It is a great university"),
  (2, "")
).toDF("id", "text")

input.withColumn("sentiment", sentiment($"text")).show()

Error: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$sentiment$1: (string) => int)

Any plans to catch this issue properly?

publishing this to spark-packages repo

@mengxr would you mind publishing to a repo now that CoreNLP 3.6.0 is available on Maven Central?

http://search.maven.org/#artifactdetails%7Cedu.stanford.nlp%7Cstanford-corenlp%7C3.6.0%7Cjar

thanks!

Update to CoreNLP 3.9.1

Can we update this to latest version of Core NLP? I'm having some trouble updating it myself

update to 3.7.0 of corenlp results in test failures

When upgrading to corenlp 3.7.0 (the current version) several test failures occur:

[info] - natlog *** FAILED ***
[info]   Array("up", "up", "up", "up", "up", "up", "up") did not equal List("up", "down", "up", "up", "up", "up", "up") (functionsSuite.scala:21)
[info] - depparse *** FAILED ***
[info]   Array([University,2,compound,Stanford,1,1.0], [located,4,nsubjpass,University,2,1.0], [located,4,auxpass,is,3,1.0], [California,6,case,in,5,1.0], [located,4,nmod:in,California,6,1.0], [located,4,punct,.,7,1.0]) did not equal List([University,2,compound,Stanford,1,1.0], [located,4,nsubj,University,2,1.0], [located,4,cop,is,3,1.0], [California,6,case,in,5,1.0], [located,4,nmod:in,California,6,1.0], [located,4,punct,.,7,1.0]) (functionsSuite.scala:21)

Python/PySpark wrapper for StanfordNLP

With Stanford NLP up and running in Python is there an intention of developing a Python/PySpark wrapper as well? And in the meantime, what would be the best way, say to Lemmatize using Stanford NLP using PySpark? In a UDF?

French version

Hi,

For those who are interested I am maintaining a french version of this library.

Function for getting set of all possible POS tags

Could we have a spark wrapper for this function:
https://github.com/stanfordnlp/CoreNLP/blob/5fdbfb209069276e95e1765093df9855d2cf2c38/src/edu/stanford/nlp/tagger/maxent/TTags.java#L288

to get the set of all possible POS tags, not that tags of a particular sentence?

updating to stanford corenlp models 4.0.0

Hi,
Standford corenlp has been udgraded to 4.0.0, and there are some changes of the classpath on its English models, so use 4.0.0 with spark-corenlp 0.4.0-spark2.4-scala2.11 will throw exceptions about the classpath, specifically, "edu/stanford/nlp/models/kbp/regexner_caseless.tab" is current located at /edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab"

Replacing old annotator "tokenize" language "es" with language "en"...

I have this code to run corenlp with spanish language. I use the databricks api in scala:

var props: Properties = new Properties()
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
props.setProperty("tokenize.language", "es")
props.setProperty("tokenize.verbose", "true")
props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger")
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.setProperty("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")
val sentimentPipeline = new StanfordCoreNLP(props)
val output = df
.select(explode(ssplit('_c3)).as('sen))
.select('sen, tokenize('sen).as('words) , ner('sen).as('nerTags) )
output.show(truncate = false)

My POM.xml file look like this:

edu.stanford.nlp stanford-corenlp 3.8.0 edu.stanford.nlp stanford-corenlp 3.8.0 models-spanish databricks spark-corenlp 0.2.0-s_2.10 i get this error: Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as class path, filename or URL

I saw in my log this before the error:

17/07/30 19:03:04 INFO AnnotatorPool: Replacing old annotator "tokenize" with signature [tokenize.language:es;tokenize.verbose:true;] with new annotator with signature [ssplit.isOneSentence:true;tokenize.language:en;tokenize.class:PTBTokenizer;]
17/07/30 19:03:04 INFO AnnotatorPool: Replacing old annotator "ssplit" with signature [tokenize.language:es;tokenize.verbose:true;] with new annotator with signature [ssplit.isOneSentence:true;tokenize.language:en;tokenize.class:PTBTokenizer;]

I think this is the reason of my error because the language has been replaced "automatically"¿?
thanks

Example Program Issue

Hi,
I am trying to run the below example program with Spark 1.6 and Java 1.8.0_60
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._

val input = Seq(
(1, "Stanford University is located in California. It is a great university.")
).toDF("id", "text")

val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Its throwing exception on assigning Output variable; error is as - error: bad symbolic reference. A signature in functions.class refers to type UserDefinedFunction
in package org.apache.spark.sql.expressions which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling functions.class.
:36: error: org.apache.spark.sql.expressions.UserDefinedFunction does not take parameters
val output = input.select(cleanxml('text).as('doc)).select(explode(ssplit('doc)).as('sen)).select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Can you please advise where I am making mistake ?

Can't use Annotation entitymentions

Hi,
Can't use entitymentions annotator, causing an error edu.stanford.nlp.pipeline.Protobuf Annotations Serializer$Lossy SerializationException: Keys are not being serialized: class edu.stanford.nlp.ling.CoreAnnotations$MentionsAnnotation, Please check

Can't execute your example code using Java

Hello,

I'm doing some tests using your CoreNLP wrapper but I'm unable to execute the example code you provided:

DataFrame df = sqlContext.read().json("corenlptest.json");
CoreNLP coreNLP = new CoreNLP()
      .setInputCol("text")
      .setAnnotators(new String[]{"tokenize", "ssplit", "lemma"})
      .setFlattenNestedFields(new String[]{"sentence_token_word"})
      .setOutputCol("parsed");
DataFrame outputDF = coreNLP.transform(df);

The stack trace is:

Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object; at com.databricks.spark.corenlp.CoreNLP$.extractElementType(CoreNLP.scala:170) at com.databricks.spark.corenlp.CoreNLP$.com$databricks$spark$corenlp$CoreNLP$$flattenStructField(CoreNLP.scala:162) at com.databricks.spark.corenlp.CoreNLP$$anonfun$2.apply(CoreNLP.scala:90) at com.databricks.spark.corenlp.CoreNLP$$anonfun$2.apply(CoreNLP.scala:89) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at com.databricks.spark.corenlp.CoreNLP.outputSchema(CoreNLP.scala:89) at com.databricks.spark.corenlp.CoreNLP.transform(CoreNLP.scala:80) at com.test.CoreNLPSpark.main(CoreNLPSpark.java:35)

Do you have an idea what can cause this? Is my attempt wrong or can this be considered a bug?

java.lang.ClassNotFoundException: com.databricks.spark.corenlp.functions$$anonfun$ssplit$1

Hi,

I am getting following error while executing show output.I have create fat jar of project using sbt package assembly.
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, JGUSCHADOOP): java.lang.ClassNotFoundException: com.databricks.spark.corenlp.functions$$anonfun$ssplit$1

Am I missing something.?Please suggest.

Thanks,
Mahesh

Enable coref test on Travis

There are two issues with coeref test on Travis:

It needs more than 3GB memory. With other memory usage, it is likely to hit the memory limit (4GB) on Travis containers.
It takes couple minutes to run on my local and it takes longer on Travis, which might exceed the default timeout (10 minutes).

The test is ignored but it would be nice to re-enable it.

0.3.0 doesn't work on Spark 2.3 due to protobuf version conflict

Spark depends on protobuf-java 2.5.0 but corenlp depends on 3.2.0. Error:

java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @3
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
    0x0000000: 2a2b 1cb6 0024 b0

We need to try shading spark-corenlp dependencies, which might take some work.

Getting error Exception in thread "main" java.lang.NoSuchMethodError

Hi,

I am trying to run Spark-CoreNLP but getting the following error:

Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; at com.databricks.spark.corenlp.functions$.cleanxml(functions.scala:54)

SBT configurations looks as below:

scalaVersion := "2.11.8"

libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.11" % "2.0.0",
"org.apache.spark" % "spark-sql_2.11" % "2.0.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier "models",
"databricks" % "spark-corenlp" % "0.2.0-s_2.11"
)

I am running on standalone Spark cluster 2.0.0 (scala 2.11.8) using:

spark-submit --jars C:\Users\raghugvt\.ivy2\cache\databricks\spark-corenlp\jars\spark-corenlp-0.1.jar,C:\Users\raghugvt\.ivy2\cache\edu.stanford.nlp\stanford-corenlp\jars\stanford-corenlp-3.6.0-models.jar,C:\Users\raghugvt\.ivy2\cache\edu.stanford.nlp\stanford-corenlp\jars\stanford-corenlp-3.6.0.jar --class SparkStanfordNLPTest --master local[2] target\scala-2.11\TestSparkCoreNLP_2.11-1.0.jar

Please help.

Is it possible to specify the "language" model?

Hi all,

first of all, thank you for having made this wrapper available. Really useful.

Could you let me know if it is possible to specify the underlying CoreNLP model (english, french, ...) ?

According to what I understand from your code, it won't be easy since you use the simple Core API but it should be possible. Any idea/plans to extend your code with this possibility?

Regards,

Grégory

protobuf dependency conflict between corenlp and spark

@mengxr it looks like spark is stuck on protobuf-java version 2.5.0 (https://github.com/apache/spark/blob/0a38637d05d2338503ecceacfb911a6da6d49538/pom.xml#L130) while corenlp has charged ahead with v 2.6.1.

how did you overcome this conflict?

here's the stack trace:

java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
    at edu.stanford.nlp.pipeline.CoreNLPProtos$Token$Builder.buildPartial(CoreNLPProtos.java:12243)
    at edu.stanford.nlp.pipeline.CoreNLPProtos$Token$Builder.build(CoreNLPProtos.java:12145)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:238)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:384)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:345)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:494)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:456)
    at com.databricks.spark.corenlp.CoreNLP$$anonfun$1.apply(CoreNLP.scala:77)
    at com.databricks.spark.corenlp.CoreNLP$$anonfun$1.apply(CoreNLP.scala:73)

btw, it looks like corenlp 3.6.0 is avaliable, but will be released to maven central sometime in january.

Constituency Parsing ?

I can see that there is a function defined for dependency parsing depparse. However I can't see if Constituency Parsing parse in the list of functions. Is there any way I can get the constituency parsing ?

Cannot use version 0.3.0 with SBT

Hi there, a Scala newbie question. I'm trying to use this package. However,

How should it be added to build.sbt? "com.databricks" % "spark-corpnlp" % "0.3.0-SNAPSHOT" does not work ("not found" error).
The Readme mentioned that "CoreNLP jars must be added to dependencies". Could you please kindly paste the link to the jars in question? Are these the jars for this project? The jars for the Stanford project? Are they to be added in build.sbt, or have the files manually pasted into a specific directory?

Anyone else who has got this working, please help by providing more detailed setup instructions. Thx!!

Does this project need Java 8,Compulsory?

Hi,

Is it compulsory to use Java 8.I am getting error like
java.lang.UnsupportedClassVersionError:
edu/stanford/nlp/simple/Document : Unsupported major.minor version 52.0

I am not sure why I am getting this error as compilation and run time env same and has java 1.8

Thanks,
Mahesh

databricks / spark-corenlp Goto Github PK

spark-corenlp's Introduction

Stanford CoreNLP wrapper for Apache Spark

Databricks

Dependencies

Acknowledgements

spark-corenlp's People

Contributors

Stargazers

Watchers

Forkers

spark-corenlp's Issues

Recommend Projects

Recommend Topics

Recommend Org