Giter Site home page Giter Site logo

aas's Introduction

Advanced Analytics with Spark Source Code

Code to accompany Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills.

Advanced Analytics with Spark

3rd edition (current)

The source to accompany the 3rd edition is found in this, the default master branch.

2nd Edition (current)

The source to accompany the 2nd edition may be found in the 2nd-edition branch.

1st Edition

The source to accompany the 1st edition may be found in the 1st-edition branch.

Build

Apache Maven 3.2.5+ and Java 8+ are required to build. From the root level of the project, run mvn package to compile artifacts into target/ subdirectories beneath each chapter's directory.

Running the Examples

  • Install Apache Spark for your platform, following the instructions for the latest release.
  • Build the projects according the instructions above.
  • Launch the driver program using spark-submit
# working directory should be your Apache Spark installation root
bin/spark-submit /path/to/code/aas/$CHAPTER/target/$CHAPTER-jar-with-dependencies-$VERSION.jar
  • Some examples might require that URI paths to the data be updated to your own HDFS or local filesystem locations.

Data Sets

Build Status

aas's People

Contributors

chansonzhang avatar dled avatar jwills avatar laserson avatar srowen avatar sryza avatar sskapci avatar tsuyo avatar yu-iskw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aas's Issues

Issues in running Chapter 6 code

I have packaged the chapter 6 and included the jar using spark-shell.

When I am trying to execute the below code without @transient

@transient val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "")
conf.set(XmlInputFormat.END_TAG_KEY, "")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)

I get Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration .

With transient in place I can proceed further, but after the below transformation

val plainText = rawXmls.flatMap(wikiXmlToPlainText)

I ran a plainText.count

And it gives me the below error.

java.lang.NoClassDefFoundError: com/google/common/base/Charsets
at com.cloudera.datascience.common.XmlInputFormat$XmlRecordReader.(XmlInputFormat.java:79)
at com.cloudera.datascience.common.XmlInputFormat.createRecordReader(XmlInputFormat.java:55)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

Am I missing something here.
I am using spark 1.2 and Hadoop 2.5.2

Chapter 11 data set: 404

I'm trying to get the zebrafish data set. Seems the down-scaled sample is no longer in the Thunder distro.

ch06 - value flatmap is not a member of org.apache.spark.rdd.RDD[String]

When trying to follow along with the example in chapter 6, I get an error when trying to convert the xml to plain text.

scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
:42: error: value flatmap is not a member of org.apache.spark.rdd.RDD[String]
val plaintext = rawXmls.flatmap(wikiXmlToPlainText)
^

Any ideas?

Chapter 10: adamLoad()

In the latest release of 'adam' adamLoad() has been changed to more specific methods:

E.g. - for reading 'adamLoad' has been replaced with 'loadAlignments':

val readsRDD:RDD[AlignmentRecord] = sc.loadAlignments("/Users/davidlaxer/genomics/reads/HG00103")

See:

bigdatagenomics/adam#835 (comment)

transient variable conf

To run the following statement in Spark shell make the variable transient.
val conf = new Configuration()

Replace with:
@transient val conf = new Configuration()

As the ch code is for Spark shell should be mentioned to make the variable @transient.

ch06 - value toMap is not a member of org.apache.spark.rdd.RDD[(String, Double)]

When using the book's example code:

val idfs = docFreqs.map{
| case (term, count) => (term, math.log(numDocs.toDouble / count))
| }.toMap

I get back:

:104: error: value toMap is not a member of org.apache.spark.rdd.RDD[(String, Double)]
possible cause: maybe a semicolon is missing before `value toMap'?
}.toMap
^

Currently importing:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
import org.apache.spark.SparkContext._

//lemmatization
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._

//need properties class
import java.util.Properties

//need array buffer class
import scala.collection.mutable.ArrayBuffer

//need rdd class
import org.apache.spark.rdd.RDD

//need foreach
import scala.collection.JavaConversions._

//computing the tf-idfs
import scala.collection.mutable.HashMap

value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

The following code from ch 6 generates error.

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP)
    : Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma) && isOnlyLetters(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

The error is

<console>:37: error: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]
           for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnot
ation])) {
                            ^

value foldLeft is not a member of (String, Seq[String])

Ch 6, the following code snippet generates an error.

import scala.collection.mutable.HashMap
val docTermFreqs = lemmatized.map(terms => {
val termFreqs = terms.foldLeft(new HashMap[String, Int]()) {
(map, term) => {
map += term -> (map.getOrElse(term, 0) + 1)
map
}
}
termFreqs
})

The error is

<console>:64: error: value foldLeft is not a member of (String, Seq[String])
       val termFreqs = terms.foldLeft(new HashMap[String, Int]()) {

Provide data sources for examples

Hi,
It would be great to have data sources to run examples. Could you provide either a link, the data itself or a way to get it ?

Cheers,
Yann

Discuss 'inlining' geojson

Let's talk a little about the namespace, API, etc and whether it should be inlined into the book code repo.

mvn package getting failed for chapter 6 lsa

Hi ,

I am trying to practice chapter 6. I am trying to build the package as mentioned in the book but i am stuck with the below error.

[ERROR] Failed to execute goal on project ch06-lsa: Could not resolve dependencies for project com.cloudera.datascience:ch06-lsa:jar:1.0.0: Failure to find com.cloudera.datascience:common:jar:1.0.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]

Thanks,
Vishnu

Ch10: adamLoad error

Getting the following in the adam-shell:

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> import org.bdgenomics.formats.avro.AlignmentRecord
import org.bdgenomics.formats.avro.AlignmentRecord

scala> val readsRDD: RDD[AlignmentRecord] = sc.adamLoad("genomics/reads/HG00103")
<console>:20: error: value adamLoad is not a member of org.apache.spark.SparkContext
       val readsRDD: RDD[AlignmentRecord] = sc.adamLoad("genomics/reads/HG00103")

ch05 - potentially unexpected result from sample code

On page 92 in calculating sumSquares, the code

val sumSquares = dataAsArray.fold(
  new Array[Double](numCols)
)(
  (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2)
)

As the RDD.fold requires operator to be communicative, which was violated by asymmetry in the map() function, the result might be different for different number of partitions in RDD.

I have 2 questions.

1:I tried the code in chapt5 Kmens-clousters. In my eclipse for scal, there are some complie error about Vector.
def distance(a: Vector,b: Vector) = math.sqrt(a.toArray.zip(b.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
there is a error about Vector: type vector takes type parameters.
I don't know what's wrong.

2:Then I tried the code in chapt8. I couldn't get a jar where contain "com.cloudera.datascience.geotime.GeoJsonProtocol._".

In pom.xml there is a dependency

com.cloudera.datascience
common
${project.version}

where can I fetch this "common" jar. I don't found it in site "http://mvnrepository.com/".

Could someone help me ?
thanks.

Can I run the 'AAS' code from a Zepplin Notebook?

Can I run the 'AAS' Code from Zepplin?
If yes, how do I import the chapter's .jar?

Spark Shell:
~/spark/bin/spark-shell --jars target/ch06-lsa-1.0.0.jar

Zeppelin:
./bin/zeppelin-daemon.sh start
Pid dir doesn't exist, create /home/ubuntu/incubator-zeppelin/run
Zeppelin start [ OK ]

Thanks in advance!

Ch06 - org.apache.spark.SparkException: Task not serializable

The book example uses the path to wikidump.xml, but the github code is looking at a directory. Where and how was the xml file broken up? I'm getting this error in the preprocessing function when running flatmap.

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:303)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:302)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:302)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$RunLSA$.preprocessing(:234)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$RunLSA$.main(:65)

Additionally, is there any documentation on how to run RunLSA? The book example uses spark-shell but I've had to change a few things to get the github code to play nicely with spark-shell

cant't find XmlInputFormat class for ch6 and ch7

I cant'n find this dependency in mvnrepository or github as described in ch6 and ch7, can you give it to me?
you book is great,and i learn a lot, thanks

<dependency>
  <groupId>com.cloudera.datascience</groupId>
  <artifactId>common</artifactId>
  <version>${project.version}</version>
</dependency>

Chapter 6: numDocs not defined

Forgive me if I'm opening and closing too many issues...

Chapter 6 has the code:

val idfs = docFreqs.map{
case (term, count) => (term, math.log(numDocs.toDouble / count))
}.toMap

I don't see numDocs defined in any of the code up to this point. Is this a typo? Should it be numTerms?

ch06 - error: value _1 is not a member of scala......

<console>:93: error: value _1 is not a member of scala.collection.mutable.HashMap[String,Int]
       val docIds = docTermFreqs.map(_._1).zipWithUniqueId().map(_.swap).collectAsMap()

The code in the book doesn't define any variables called docIds and I don't see any comments about it in the code. Having trouble debugging this as I'm not sure exactly what this line is trying to accomplish. What does (_._1) mean?

It's a bit frustrating that the code in the book doesn't work and I can't manage to get the code on github to work either. Any help would be much appreciated

Issues in Chapter 8 nscla_time

Hi ,

I tried to add the nscala-time_2.10-1.8.0.jar to spark shell and imported the package. But unfortunately when i use it , i end up with this error.

scala> import com.github.nscala_time.time.Imports._
import com.github.nscala_time.time.Imports._

scala> val dt = new DateTime(2015,2,2,20,0)
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in BuilderImplicits.class refers to term time
in value org.joda which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling BuilderImplicits.class.
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last one.

Can you help me to resolve this.

ch06 - not found: value bIdfs

In the chapter 6 text,

case (term,freq) => (bTermIds(term), bIdfs(term) * termFreqs(term) / docTotalTerms)

bIdfs isn't defined and I cant find an equivalent in the RunLSA code.

Problems build Chapter 6 (and also the root)

Hi,
I'm running on Ubuntu 14.04 LTS on an EC2 instance.
ubuntu@ip-10-0-1-186:/aas$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
ubuntu@ip-10-0-1-186:
/aas$

When I tried to run $mvn package in ch06-lsa:
ubuntu@ip-10-0-1-186:/aas/ch06-lsa$ mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Wikipedia Latent Semantic Analysis 1.0.0
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for com.cloudera.datascience:common:jar:1.0.0 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.057s
[INFO] Finished at: Tue May 26 21:14:19 UTC 2015
[INFO] Final Memory: 11M/225M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project ch06-lsa: Could not resolve dependencies for project com.cloudera.datascience:ch06-lsa:jar:1.0.0: Failure to find com.cloudera.datascience:common:jar:1.0.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
ubuntu@ip-10-0-1-186:
/aas/ch06-lsa$

When I tried to run mvn in the root:

ubuntu@ip-10-0-1-186:/aas$ mvn install
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Advanced Analytics with Spark
[INFO] Advanced Analytics with Spark Common
[INFO] Introduction to Data Analysis with Scala and Spark
[INFO] Recommender Engines with Audioscrobbler data
[INFO] Covtype with Random Decision Forests
[INFO] Anomaly Detection with K-means
[INFO] Wikipedia Latent Semantic Analysis
[INFO] Network Analysis with GraphX
[INFO] Temporal and Geospatial Analysis
[INFO] Value at Risk through Monte Carlo Simulation
[INFO] Genomics Analysis with ADAM
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Advanced Analytics with Spark 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce) @ spark-book-parent ---
[WARNING] Rule 1: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message:
Detected Maven Version: 3.0.5 is not in the allowed range 3.1.1.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Advanced Analytics with Spark ..................... FAILURE [1.570s]
[INFO] Advanced Analytics with Spark Common .............. SKIPPED
[INFO] Introduction to Data Analysis with Scala and Spark SKIPPED
[INFO] Recommender Engines with Audioscrobbler data ...... SKIPPED
[INFO] Covtype with Random Decision Forests .............. SKIPPED
[INFO] Anomaly Detection with K-means .................... SKIPPED
[INFO] Wikipedia Latent Semantic Analysis ................ SKIPPED
[INFO] Network Analysis with GraphX ...................... SKIPPED
[INFO] Temporal and Geospatial Analysis .................. SKIPPED
[INFO] Value at Risk through Monte Carlo Simulation ...... SKIPPED
[INFO] Genomics Analysis with ADAM ....................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.867s
[INFO] Finished at: Tue May 26 21:15:21 UTC 2015
[INFO] Final Memory: 14M/285M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce (enforce) on project spark-book-parent: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
ubuntu@ip-10-0-1-186:
/aas$

Any ideas?

ch09 - breaking in trimToRegion when featurizing S&P-crudeoil-etc factors

It's breaking when calling trimmed.head._1 in trimToRegion from line 200 in featurize(). I'm still relatively new to scala so I'm not well equipped to investigate what exactly is breaking. I tried only running it on factors1 and also only on factors2 instead of the two concatenated together, and it still breaks both times. factors2 consists of the S&P and NASDAQ data downloaded with the provided shell script -- as opposed to crude oil and us t-bonds copy-and-pasted from investing.com -- so I'm thinking this might be a true bug rather than something I introduced. I'll go hone my scala skills a bit more so I can investigate this further. If I find anything I'll let you know.

spark-submit --class com.cloudera.datascience.risk.RunRisk --master local target/ch09-risk-1.0.2-jar-with-dependencies.jar

Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
        at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
        at scala.collection.IterableLike$class.head(IterableLike.scala:91)
        at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
        at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
        at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
        at com.cloudera.datascience.risk.RunRisk$.trimToRegion(RunRisk.scala:200)
        at com.cloudera.datascience.risk.RunRisk$$anonfun$16.apply(RunRisk.scala:112)
        at com.cloudera.datascience.risk.RunRisk$$anonfun$16.apply(RunRisk.scala:112)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
        at com.cloudera.datascience.risk.RunRisk$.readStocksAndFactors(RunRisk.scala:112)
        at com.cloudera.datascience.risk.RunRisk$.main(RunRisk.scala:34)
        at com.cloudera.datascience.risk.RunRisk.main(RunRisk.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

ch06 - error: type mismatch in TopTermsInTopConcepts function

scala> val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
<console>:108: error: type mismatch;
 found   : scala.collection.immutable.Map[String,Int]
 required: Map[Int,String]
       val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
                                                                                                           ^(under termIds)
using code

import scala.collection.mutable.ArrayBuffer
def topTermsInTopConcepts(svd: SingularValueDecomposition[RowMatrix, Matrix], numConcepts: Int,
      numTerms: Int, termIds: Map[Int, String]): Seq[Seq[(String, Double)]] = {
    val v = svd.V
    val topTerms = new ArrayBuffer[Seq[(String, Double)]]()
    val arr = v.toArray
    for (i <- 0 until numConcepts) {
        val offs = i * v.numRows
        val termWeights = arr.slice(offs, offs + v.numRows).zipWithIndex
        val sorted = termWeights.sortBy(-_._1)
        topTerms += sorted.take(numTerms).map{
            case (score, id) => (termIds(id), score)
        }
    }
    topTerms
}

any ideas? Also the function in the book is missing the function definition, though there is still a return statement.

Add, standardize illustrations

We have some illustrations in chapters, but some chapters still have none. This is a placeholder to remind us to go back and review more recent chapters for illustrations.

Complete Acknowledgements

I started adding an Acknowledgements section to the Preface; we'll need to make sure it's complete before finishing.

Chapter 6: java.lang.IllegalArgumentException: No annotator named tokenize

Following the example in chapter 6, I am getting the following error shortly after running: docTermFreqs.flatMap(_.keySet).distinct().count()

It starts splitting input and executing tasks then:
15/07/10 15:42:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $line171.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
edu.stanford.nlp.pipeline.AnnotatorImplementations:
15/07/10 15:42:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $line171.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/07/10 15:42:41 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Adding annotator pos
15/07/10 15:42:41 INFO TaskSchedulerImpl: Cancelling stage 0
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 9.0 in stage 0.0 (TID 9)
15/07/10 15:42:41 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 10.0 in stage 0.0 (TID 10)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 4.0 in stage 0.0 (TID 4)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 11.0 in stage 0.0 (TID 11)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 12.0 in stage 0.0 (TID 12)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 13.0 in stage 0.0 (TID 13)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 5.0 in stage 0.0 (TID 5)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 6.0 in stage 0.0 (TID 6)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 7.0 in stage 0.0 (TID 7)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 14.0 in stage 0.0 (TID 14)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 8.0 in stage 0.0 (TID 8)
15/07/10 15:42:41 INFO DAGScheduler: Job 0 failed: count at :96, took 1.887817 s
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Ch 6 -not enough arguments for method plainTextToLemmas:

The following statement from ch 6 generates an error:

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
<console>:59: error: not enough arguments for method plainTextToLemmas: (text: S
tring, stopWords: Set[String], pipeline: edu.stanford.nlp.pipeline.StanfordCoreN
LP)Seq[String].
Unspecified value parameter pipeline.
       val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

The repo code is fine.

val lemmatized = plainText.mapPartitions(iter => {
      val pipeline = createNLPPipeline()
      iter.map{ case(title, contents) => (title, plainTextToLemmas(contents, stopWords, pipeline))}
    })

Ch 10 data

Can't get 16 gb dataset:

$ curl  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
curl: (6) Could not resolve host: ftp-trace.ncbi.nih.gov

maven build errors for XmlInputFormat.java involving StandardCharsets

Maven build errors for XmlInputFormat.java. Stacktrace indicates maven plugin issue (MojoExecutor)

errors like this
".../common/XmlInputFormat.java:[21,24] cannot find symbol
symbol : class StandardCharsets
location: package java.nio.charset "

I'm on java 7.
java version "1.7.0_71"

rolling back to previous version of XmlInputFormat.java (including Guava) works.

Thanks for any guidance.

How do you run these .jar files?

It's not clear how to run the .jar files from the documentation. For example, ch06-lsa requires the file stopwords.txt, but that file is in target/classes/stopwords.txt and in src/main/resources/stopwords.txt. When the jar file is run with this command: spark-submit --class com.cloudera.datascience.lsa.RunLSA target/ch06-lsa-1.0.2-jar-with-dependencies.jar an error is generated indicating that the file stopwords.txt can't be found.

Chapter 10 build broken by an ADAM change

I just pulled the book source (master 94fa09d) and got the following error when running mvn:

[INFO] --- scala-maven-plugin:3.2.0:compile (default) @ ch10-genomics ---
[INFO] artifact joda-time:joda-time: checking for updates from central
[INFO] /Users/tom/src/scala/aas/ch10-genomics/src/main/scala:-1: info: compiling
[INFO] Compiling 1 source files to /Users/tom/src/scala/aas/ch10-genomics/target/classes at 1424131711406
[ERROR] /Users/tom/src/scala/aas/ch10-genomics/src/main/scala/com/cloudera/datascience/genomics/RunTFPrediction.scala:16: error: object FeaturesContext is not a member of package org.bdgenomics.adam.rdd.features
[ERROR] import org.bdgenomics.adam.rdd.features.FeaturesContext._
[ERROR]                                         ^

followed by 8 others in the same file.

This would appear to be caused by the last commit to ADAM (bigdatagenomics/adam@3f0eadb) which removed the FeaturesContext and GeneContext classes and apparently replaced their functions with the more generic loadFeatures function in the ADAMContext.

there is somethings wrong when i launched ch7-graph

there is nothing wrong when i run mvn assembly:assembly under the folder of ch07-graph.

INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResource
sRatio: 0.8
Exception in thread "main" java.lang.NoClassDefFoundError: com/cloudera/datascience/common/XmlInputFormat
at com.cloudera.datascience.graph.RunGraph$.loadMedline(RunGraph.scala:188)
at com.cloudera.datascience.graph.RunGraph$.main(RunGraph.scala:29)
at com.cloudera.datascience.graph.RunGraph.main(RunGraph.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.cloudera.datascience.common.XmlInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 12 more
16/01/15 16:36:50 INFO spark.SparkContext: Invoking stop() from shutdown hook

ch08 error: not found: type DateTime

Hi, I have built the mvn package from the root to create all the jar files for the chapters. I copied the jar file for ch08 into the external folder of the spark-1.5.0 install and start with ./bin/spark-shell --jars external/ch08-geotime-1.0.1-jar-with-dependencies.jar --master local[*]
All the imports work ok including import com.github.nscala_time.time.Imports._
I can create new datatime e.g. val test = new DateTime
but when I run
case class Trip(
pickupTime: DateTime,
dropoffTime: DateTime,
pickupLoc: Point,
dropoffLoc: Point)
I get the error: error: not found: type DateTime

Could you please give me a tip as to why it would work for creating a new object but not recognised in the case class?
Many thanks,

compilation CH06

Hi,

I'm trying chapter 6 and i have 2 questions:
First,
cd aas
mvn install
cd ch06-lsa
mvn package
cd ..
./spark/bin/spark-submit --class com.cloudera.datascience.lsa.RunLSA aas/ch06-lsa/target/ch06-lsa-1.0.0.jar

but I get an error :

15/05/28 18:07:33 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.NoClassDefFoundError: edu/umd/cloud9/collection/wikipedia/WikipediaPage
at com.cloudera.datascience.lsa.RunLSA$.preprocessing(RunLSA.scala:54)
at com.cloudera.datascience.lsa.RunLSA$.main(RunLSA.scala:33)
at com.cloudera.datascience.lsa.RunLSA.main(RunLSA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: edu.umd.cloud9.collection.wikipedia.WikipediaPage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

I'm launching from the master node of an EC2 Spark install (https://spark.apache.org/docs/latest/ec2-scripts.html).

Secondly, how do I launch the main function from RunLSA in the SparkShell ?

./spark/bin/spark-shell --jars aas/ch06-lsa/target/ch06-lsa-1.0.0.jar

I have been trying

import com.cloudera.datascience.lsa.RunLSA
RunLSA.main(Array("100","1000","0.1"))

but I get the error

15/05/28 18:14:21 WARN spark.SparkContext: Multiple running SparkContexts detected in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:80)

Just looking for your best practice.

Thanks a lot.

chapter 6: termDocMatrix not defined

In chapter 6,

termDocMatrix.cache()

This variable isn't defined earlier in the chapter. I haven't been able to find a suitable way to do this without changing everything to be more like RunLSA, which creates a differnt set of issues.

Any assistance appreciated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.