Giter Site home page Giter Site logo

spark-corenlp's People

Contributors

bororea avatar mengxr avatar slothspot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-corenlp's Issues

update to 3.7.0 of corenlp results in test failures

When upgrading to corenlp 3.7.0 (the current version) several test failures occur:

[info] - natlog *** FAILED ***
[info]   Array("up", "up", "up", "up", "up", "up", "up") did not equal List("up", "down", "up", "up", "up", "up", "up") (functionsSuite.scala:21)
[info] - depparse *** FAILED ***
[info]   Array([University,2,compound,Stanford,1,1.0], [located,4,nsubjpass,University,2,1.0], [located,4,auxpass,is,3,1.0], [California,6,case,in,5,1.0], [located,4,nmod:in,California,6,1.0], [located,4,punct,.,7,1.0]) did not equal List([University,2,compound,Stanford,1,1.0], [located,4,nsubj,University,2,1.0], [located,4,cop,is,3,1.0], [California,6,case,in,5,1.0], [located,4,nmod:in,California,6,1.0], [located,4,punct,.,7,1.0]) (functionsSuite.scala:21)

Getting error Exception in thread "main" java.lang.NoSuchMethodError

Hi,

I am trying to run Spark-CoreNLP but getting the following error:

Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; at com.databricks.spark.corenlp.functions$.cleanxml(functions.scala:54)

SBT configurations looks as below:

scalaVersion := "2.11.8"

libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.11" % "2.0.0",
"org.apache.spark" % "spark-sql_2.11" % "2.0.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier "models",
"databricks" % "spark-corenlp" % "0.2.0-s_2.11"
)

I am running on standalone Spark cluster 2.0.0 (scala 2.11.8) using:

spark-submit --jars C:\Users\raghugvt\.ivy2\cache\databricks\spark-corenlp\jars\spark-corenlp-0.1.jar,C:\Users\raghugvt\.ivy2\cache\edu.stanford.nlp\stanford-corenlp\jars\stanford-corenlp-3.6.0-models.jar,C:\Users\raghugvt\.ivy2\cache\edu.stanford.nlp\stanford-corenlp\jars\stanford-corenlp-3.6.0.jar --class SparkStanfordNLPTest --master local[2] target\scala-2.11\TestSparkCoreNLP_2.11-1.0.jar

Please help.

Example Program Issue

Hi,
I am trying to run the below example program with Spark 1.6 and Java 1.8.0_60
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._

val input = Seq(
(1, "Stanford University is located in California. It is a great university.")
).toDF("id", "text")

val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Its throwing exception on assigning Output variable; error is as - error: bad symbolic reference. A signature in functions.class refers to type UserDefinedFunction
in package org.apache.spark.sql.expressions which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling functions.class.
:36: error: org.apache.spark.sql.expressions.UserDefinedFunction does not take parameters
val output = input.select(cleanxml('text).as('doc)).select(explode(ssplit('doc)).as('sen)).select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Can you please advise where I am making mistake ?

Does this project need Java 8,Compulsory?

Hi,

Is it compulsory to use Java 8.I am getting error like
java.lang.UnsupportedClassVersionError:
edu/stanford/nlp/simple/Document : Unsupported major.minor version 52.0

I am not sure why I am getting this error as compilation and run time env same and has java 1.8

Thanks,
Mahesh

java.lang.NoSuchMethodError with scala/spark 2.10

I'm getting the same error that raghugvt posted here. He solved the problem by bundeling everything together in one jar, however thats not an option as I would like to use spark-corenlp in a notebook.

My build.sbt is as follows:


version := "1.0"

scalaVersion := "2.10.6"

resolvers += "Spark Packages Repository" at "https://dl.bintray.com/spark-packages/maven/"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "2.1.0",
  "org.apache.spark" % "spark-sql_2.10" % "2.1.0",
  "com.databricks" % "spark-csv_2.10" % "1.5.0",
  "org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)

libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.7.0" withSources() withJavadoc()
libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.7.0" classifier "models"
libraryDependencies += "databricks" % "spark-corenlp" % "0.2.0-s_2.11"

I'm testing with this script:

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder().master("local")
  .appName("Spark SQL basic example")
  .config("master", "spark://myhost:7077")
  .getOrCreate()

val sqlContext = spark.sqlContext

import sqlContext.implicits._

val input = Seq(
  (1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

output.show(truncate = false)

Which results in the error:

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
	at com.databricks.spark.corenlp.functions$.cleanxml(functions.scala:54)

What is going wrong here?

Replacing old annotator "tokenize" language "es" with language "en"...

I have this code to run corenlp with spanish language. I use the databricks api in scala:

var props: Properties = new Properties()
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
props.setProperty("tokenize.language", "es")
props.setProperty("tokenize.verbose", "true")
props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger")
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.setProperty("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")
val sentimentPipeline = new StanfordCoreNLP(props)
val output = df
.select(explode(ssplit('_c3)).as('sen))
.select('sen, tokenize('sen).as('words) , ner('sen).as('nerTags) )
output.show(truncate = false)

My POM.xml file look like this:

edu.stanford.nlp stanford-corenlp 3.8.0 edu.stanford.nlp stanford-corenlp 3.8.0 models-spanish databricks spark-corenlp 0.2.0-s_2.10 i get this error: Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as class path, filename or URL

I saw in my log this before the error:

17/07/30 19:03:04 INFO AnnotatorPool: Replacing old annotator "tokenize" with signature [tokenize.language:es;tokenize.verbose:true;] with new annotator with signature [ssplit.isOneSentence:true;tokenize.language:en;tokenize.class:PTBTokenizer;]
17/07/30 19:03:04 INFO AnnotatorPool: Replacing old annotator "ssplit" with signature [tokenize.language:es;tokenize.verbose:true;] with new annotator with signature [ssplit.isOneSentence:true;tokenize.language:en;tokenize.class:PTBTokenizer;]

I think this is the reason of my error because the language has been replaced "automatically"¿?
thanks

updating to stanford corenlp models 4.0.0

Hi,
Standford corenlp has been udgraded to 4.0.0, and there are some changes of the classpath on its English models, so use 4.0.0 with spark-corenlp 0.4.0-spark2.4-scala2.11 will throw exceptions about the classpath, specifically, "edu/stanford/nlp/models/kbp/regexner_caseless.tab" is current located at /edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab"

How to use this lib in java

Hi all,
Would you tell me how to use this lib in java?
I can import it, but i do not know to use.
"import com.databricks.spark.corenlp.functions;"
The maven project pom.xml:

databricks
spark-corenlp
0.2.0-s_2.11

In the following, there is my program.
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
StructField[] structFields = new StructField[]{
new StructField("intColumn", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("stringColumn", DataTypes.StringType, true, Metadata.empty())
};

StructType structType = new StructType(structFields);

List rows = new ArrayList<>();
rows.add(RowFactory.create(1, "Stanford University is located in California. " +
"It is a great university."));

Dataset df = spark.createDataFrame(rows, structType);

System.out.println("Test Count = " + df.count());

df.show();

Rick

Update to CoreNLP 3.9.1

Can we update this to latest version of Core NLP? I'm having some trouble updating it myself

0.3.0 doesn't work on Spark 2.3 due to protobuf version conflict

Spark depends on protobuf-java 2.5.0 but corenlp depends on 3.2.0. Error:

java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @3
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
    0x0000000: 2a2b 1cb6 0024 b0

We need to try shading spark-corenlp dependencies, which might take some work.

Is it possible to specify the "language" model?

Hi all,

first of all, thank you for having made this wrapper available. Really useful.

Could you let me know if it is possible to specify the underlying CoreNLP model (english, french, ...) ?

According to what I understand from your code, it won't be easy since you use the simple Core API but it should be possible. Any idea/plans to extend your code with this possibility?

Regards,

Grégory

Cannot use version 0.3.0 with SBT

Hi there, a Scala newbie question. I'm trying to use this package. However,

  • How should it be added to build.sbt? "com.databricks" % "spark-corpnlp" % "0.3.0-SNAPSHOT" does not work ("not found" error).
  • The Readme mentioned that "CoreNLP jars must be added to dependencies". Could you please kindly paste the link to the jars in question? Are these the jars for this project? The jars for the Stanford project? Are they to be added in build.sbt, or have the files manually pasted into a specific directory?

Anyone else who has got this working, please help by providing more detailed setup instructions. Thx!!

How to index Spark CoreNLP analysis?

I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.

val input = Seq(
(1, "Apple is located in California. It is a great company."),
(2, "Google is located in California. It is a great company."),
(3, "Netflix is located in California. It is a great company.")
).toDF("id", "text")

input.show()

input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id| text|
+---+--------------------+
| 1|Apple is loc...|
| 2|Google is lo...|
| 3|Netflix is l...|
+---+--------------------+
I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.

val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
However, in the output below i have lost the connection back to the original dataframe row ids.

+--------------------+--------------------+--------------------+---------+
| sen| words| nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--------------------+--------------------+--------------------+---------+
Ideally, I want something like the following:

+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+
I have tried to create a UDF but am unable to make it work.

NoSuchElementExeception when run large dataset

When I limit dataset size 100, it works well, but when dataset is large, it crashes program, here are some exeception,could you please give me some suggest...
java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:854)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at scala.collection.AbstractIterable.head(Iterable.scala:54)
at com.databricks.spark.corenlp.functions$$anonfun$sentiment$1.apply(functions.scala:163)
at com.databricks.spark.corenlp.functions$$anonfun$sentiment$1.apply(functions.scala:158)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/11/14 14:32:39 ERROR DefaultWriterContainer: Task attempt attempt_201611141429_0000_m_000001_0 aborted.
16/11/14 14:32:39 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Can't execute your example code using Java

Hello,

I'm doing some tests using your CoreNLP wrapper but I'm unable to execute the example code you provided:

DataFrame df = sqlContext.read().json("corenlptest.json");
CoreNLP coreNLP = new CoreNLP()
      .setInputCol("text")
      .setAnnotators(new String[]{"tokenize", "ssplit", "lemma"})
      .setFlattenNestedFields(new String[]{"sentence_token_word"})
      .setOutputCol("parsed");
DataFrame outputDF = coreNLP.transform(df);

The stack trace is:

Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object; at com.databricks.spark.corenlp.CoreNLP$.extractElementType(CoreNLP.scala:170) at com.databricks.spark.corenlp.CoreNLP$.com$databricks$spark$corenlp$CoreNLP$$flattenStructField(CoreNLP.scala:162) at com.databricks.spark.corenlp.CoreNLP$$anonfun$2.apply(CoreNLP.scala:90) at com.databricks.spark.corenlp.CoreNLP$$anonfun$2.apply(CoreNLP.scala:89) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at com.databricks.spark.corenlp.CoreNLP.outputSchema(CoreNLP.scala:89) at com.databricks.spark.corenlp.CoreNLP.transform(CoreNLP.scala:80) at com.test.CoreNLPSpark.main(CoreNLPSpark.java:35)

Do you have an idea what can cause this? Is my attempt wrong or can this be considered a bug?

W.

Using from Databricks Community Edition

I'm trying to use this library from a Databricks CE notebook, by adding the Stanford NLP library as well as this library as dependencies. However, I get an error indicating that this library is not recognized.

Do you know if it is possible to consume this library using Databricks Community Edition notebook?

protobuf dependency conflict between corenlp and spark

@mengxr it looks like spark is stuck on protobuf-java version 2.5.0 (https://github.com/apache/spark/blob/0a38637d05d2338503ecceacfb911a6da6d49538/pom.xml#L130) while corenlp has charged ahead with v 2.6.1.

how did you overcome this conflict?

here's the stack trace:

java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
    at edu.stanford.nlp.pipeline.CoreNLPProtos$Token$Builder.buildPartial(CoreNLPProtos.java:12243)
    at edu.stanford.nlp.pipeline.CoreNLPProtos$Token$Builder.build(CoreNLPProtos.java:12145)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:238)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:384)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:345)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:494)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:456)
    at com.databricks.spark.corenlp.CoreNLP$$anonfun$1.apply(CoreNLP.scala:77)
    at com.databricks.spark.corenlp.CoreNLP$$anonfun$1.apply(CoreNLP.scala:73)  

btw, it looks like corenlp 3.6.0 is avaliable, but will be released to maven central sometime in january.

Enable coref test on Travis

There are two issues with coeref test on Travis:

  1. It needs more than 3GB memory. With other memory usage, it is likely to hit the memory limit (4GB) on Travis containers.
  2. It takes couple minutes to run on my local and it takes longer on Travis, which might exceed the default timeout (10 minutes).

The test is ignored but it would be nice to re-enable it.

Can't use Annotation entitymentions

Hi,
Can't use entitymentions annotator, causing an error edu.stanford.nlp.pipeline.Protobuf Annotations Serializer$Lossy SerializationException: Keys are not being serialized: class edu.stanford.nlp.ling.CoreAnnotations$MentionsAnnotation, Please check

Constituency Parsing ?

I can see that there is a function defined for dependency parsing depparse. However I can't see if Constituency Parsing parse in the list of functions. Is there any way I can get the constituency parsing ?

Implemetation takes really long for giving putput

Hi, I am using this library but am getting extremely slow results. For 10k records containing some texts, it has taken longer than 16 hours to process 160 tasks out of 1920 after re-partitioning. I am wonder if the name extraction is working parallely or do other executors queue one after the other for name entity recognition to happen. Python non-parallel scripts seem to work faster than this. Any suggestion, work arounds would be highly appreciated

Java 8 Requirement

Hi there, I just wonder why this package requires Java 8? Is it a hard requirement because it uses Java 8 features? Thanks!

Failed on empty strings - sentiment analysis

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

val input = Seq(
  (1, "Stanford University is located in California. It is a great university"),
  (2, "")
).toDF("id", "text")

input.withColumn("sentiment", sentiment($"text")).show()

Error: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$sentiment$1: (string) => int)

Any plans to catch this issue properly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.