sryza / aas Goto Github PK

View Code? Open in Web Editor NEW

1.5K 147.0 1.0K 70.94 MB

Code to accompany Advanced Analytics with Spark from O'Reilly Media

License: Other

Scala 91.70% R 2.82% Shell 0.55% Python 4.94%

aas's Introduction

Advanced Analytics with Spark Source Code

Code to accompany Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills.

3rd edition (current)

The source to accompany the 3rd edition is found in this, the default master branch.

2nd Edition (current)

The source to accompany the 2nd edition may be found in the 2nd-edition branch.

1st Edition

The source to accompany the 1st edition may be found in the 1st-edition branch.

Build

Apache Maven 3.2.5+ and Java 8+ are required to build. From the root level of the project, run mvn package to compile artifacts into target/ subdirectories beneath each chapter's directory.

Running the Examples

Install Apache Spark for your platform, following the instructions for the latest release.
Build the projects according the instructions above.
Launch the driver program using spark-submit

# working directory should be your Apache Spark installation root
bin/spark-submit /path/to/code/aas/$CHAPTER/target/$CHAPTER-jar-with-dependencies-$VERSION.jar

Some examples might require that URI paths to the data be updated to your own HDFS or local filesystem locations.

Data Sets

Chapter 2: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
Chapter 3: https://storage.googleapis.com/aas-data-sets/profiledata_06-May-2005.tar.gz
Chapter 4: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
Chapter 5: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (do not use http://www.sigkdd.org/kdd-cup-1999-computer-network-intrusion-detection as the copy has a corrupted line)
Chapter 6: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
Chapter 7: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/ (*.gz)
Chapter 8: https://storage.googleapis.com/aas-data-sets/trip_data_1.csv.zip (from http://www.andresmh.com/nyctaxitrips/)
Chapter 9: See https://github.com/sryza/aas/tree/master/ch09-risk/data ; included download scripts no longer work
Chapter 10: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
Chapter 11: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack

aas's People

Contributors

Stargazers

Watchers

Forkers

darkseed jefferyyuan echalkpad tianzhou2011 luyee r-wheeler alexsisu demonodojo zhao-lang jrabary feature2018 qicst23 ichocolatekapa toldervoll dougneedham pbhalesain peilong gabhi jayantshekhar reference-project notthatbreezy rlesca01 02n avikaza fengshikun tspannatpivotal srinivasseshu mickdelaney andrszamora yinxusen bcui6611 benbenqiang defaultrobot centiteo yzhang921 dbsiegel programmer-util fonhorst arafathc yashar78 barroco radek1st splade ksteward hbyw618 more-free rydt umeshkbhaskar gentang muke5hy amarchin arpit1286 chenqxi xwd0520 bhardwaj01 yanjiegao vishnu-kumar zhoubug markrockley ivanliu1989 mikeaddison93 xiaokekehaha linearregression hougs dsdinter linshifei wy36101299 drizham rohitghatol miguel0alves davegerson skrusche63 claudiouzelac kyclark mkolod nikshe skizha bfaissal puneethmishra happyawolf baokunguo manoelrui adriendumont shasdemir yshenhar ablochs imanis devendradesale solversa anaerobia yufan-l rameshvinbox akashaio cmoradi hendrasaputra futuretechlabs qingkaikong verreet kristinaplazonic hc79bchf

aas's Issues

standardize indentation when method args overflow to next line

@srowen indents up to the open paren where the args start. I indent four spaces past the previous line's indentation. Sean's is the way Intellij defaults. Mine is the way used inside Spark.

I prefer mine, but don't have a very strong opinion.

Standardize on spaces between sentences.

@laserson 's and my chapters use two. @jwills 's and @srowen 's use one. Any strong opinions?

Add, standardize illustrations

We have some illustrations in chapters, but some chapters still have none. This is a placeholder to remind us to go back and review more recent chapters for illustrations.

Chapter 10: adamLoad()

In the latest release of 'adam' adamLoad() has been changed to more specific methods:

E.g. - for reading 'adamLoad' has been replaced with 'loadAlignments':

val readsRDD:RDD[AlignmentRecord] = sc.loadAlignments("/Users/davidlaxer/genomics/reads/HG00103")

See:

bigdatagenomics/adam#835 (comment)

Could cause NumberFormatException not "cause a NumberFormatException"

Ch 3 has a statement: "These lines cause a NumberFormatException". The lines don't cause a NumberFormatException but could throw or cause a NumberFormatException.

Provide data sources for examples

Hi,
It would be great to have data sources to run examples. Could you provide either a link, the data itself or a way to get it ?

Cheers,
Yann

ch05 - potentially unexpected result from sample code

On page 92 in calculating sumSquares, the code

val sumSquares = dataAsArray.fold(
  new Array[Double](numCols)
)(
  (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2)
)

As the RDD.fold requires operator to be communicative, which was violated by asymmetry in the map() function, the result might be different for different number of partitions in RDD.

How do you run these .jar files?

It's not clear how to run the .jar files from the documentation. For example, ch06-lsa requires the file stopwords.txt, but that file is in target/classes/stopwords.txt and in src/main/resources/stopwords.txt. When the jar file is run with this command: spark-submit --class com.cloudera.datascience.lsa.RunLSA target/ch06-lsa-1.0.2-jar-with-dependencies.jar an error is generated indicating that the file stopwords.txt can't be found.

Chapter 11 data set: 404

I'm trying to get the zebrafish data set. Seems the down-scaled sample is no longer in the Thunder distro.

def loadMedline code has incorrect start and end tag keys (MedlineCitation instead of MetlineCitationSet)

Kept getting XML errors when running the code pasted from here and found the issues at lines 186 and 187 - that should be MedlineCitationSet.

https://github.com/sryza/aas/blob/master/ch07-graph%2Fsrc%2Fmain%2Fscala%2Fcom%2Fcloudera%2Fdatascience%2Fgraph%2FRunGraph.scala#L186

I have 2 questions.

1:I tried the code in chapt5 Kmens-clousters. In my eclipse for scal, there are some complie error about Vector.
def distance(a: Vector,b: Vector) = math.sqrt(a.toArray.zip(b.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
there is a error about Vector: type vector takes type parameters.
I don't know what's wrong.

2:Then I tried the code in chapt8. I couldn't get a jar where contain "com.cloudera.datascience.geotime.GeoJsonProtocol._".

In pom.xml there is a dependency

com.cloudera.datascience
common
${project.version}

where can I fetch this "common" jar. I don't found it in site "http://mvnrepository.com/".

Could someone help me ?
thanks.

transient variable conf

To run the following statement in Spark shell make the variable transient.
val conf = new Configuration()

Replace with:
@transient val conf = new Configuration()

As the ch code is for Spark shell should be mentioned to make the variable @transient.

ch06 - not found: value bIdfs

In the chapter 6 text,

case (term,freq) => (bTermIds(term), bIdfs(term) * termFreqs(term) / docTotalTerms)

bIdfs isn't defined and I cant find an equivalent in the RunLSA code.

compilation CH06

Hi,

I'm trying chapter 6 and i have 2 questions:
First,
cd aas
mvn install
cd ch06-lsa
mvn package
cd ..
./spark/bin/spark-submit --class com.cloudera.datascience.lsa.RunLSA aas/ch06-lsa/target/ch06-lsa-1.0.0.jar

but I get an error :

15/05/28 18:07:33 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.NoClassDefFoundError: edu/umd/cloud9/collection/wikipedia/WikipediaPage
at com.cloudera.datascience.lsa.RunLSA$.preprocessing(RunLSA.scala:54)
at com.cloudera.datascience.lsa.RunLSA$.main(RunLSA.scala:33)
at com.cloudera.datascience.lsa.RunLSA.main(RunLSA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: edu.umd.cloud9.collection.wikipedia.WikipediaPage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

I'm launching from the master node of an EC2 Spark install (https://spark.apache.org/docs/latest/ec2-scripts.html).

Secondly, how do I launch the main function from RunLSA in the SparkShell ?

./spark/bin/spark-shell --jars aas/ch06-lsa/target/ch06-lsa-1.0.0.jar

I have been trying

import com.cloudera.datascience.lsa.RunLSA
RunLSA.main(Array("100","1000","0.1"))

but I get the error

15/05/28 18:14:21 WARN spark.SparkContext: Multiple running SparkContexts detected in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:80)

Just looking for your best practice.

Thanks a lot.

Rename modules / change titles to reflect final book

I think in the end we may want module names, and descriptions, that are taken directly from the book instead of things like rdf

Chapter 6: numDocs not defined

Forgive me if I'm opening and closing too many issues...

Chapter 6 has the code:

val idfs = docFreqs.map{
case (term, count) => (term, math.log(numDocs.toDouble / count))
}.toMap

I don't see numDocs defined in any of the code up to this point. Is this a typo? Should it be numTerms?

Need to briefly cover the new MLlib alpha Pipeline functionality in 1.2

ch06 - error: type mismatch in TopTermsInTopConcepts function

scala> val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
<console>:108: error: type mismatch;
 found   : scala.collection.immutable.Map[String,Int]
 required: Map[Int,String]
       val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
                                                                                                           ^(under termIds)
using code

import scala.collection.mutable.ArrayBuffer
def topTermsInTopConcepts(svd: SingularValueDecomposition[RowMatrix, Matrix], numConcepts: Int,
      numTerms: Int, termIds: Map[Int, String]): Seq[Seq[(String, Double)]] = {
    val v = svd.V
    val topTerms = new ArrayBuffer[Seq[(String, Double)]]()
    val arr = v.toArray
    for (i <- 0 until numConcepts) {
        val offs = i * v.numRows
        val termWeights = arr.slice(offs, offs + v.numRows).zipWithIndex
        val sorted = termWeights.sortBy(-_._1)
        topTerms += sorted.take(numTerms).map{
            case (score, id) => (termIds(id), score)
        }
    }
    topTerms
}

any ideas? Also the function in the book is missing the function definition, though there is still a return statement.

ch08 error: not found: type DateTime

Hi, I have built the mvn package from the root to create all the jar files for the chapters. I copied the jar file for ch08 into the external folder of the spark-1.5.0 install and start with ./bin/spark-shell --jars external/ch08-geotime-1.0.1-jar-with-dependencies.jar --master local[*]
All the imports work ok including import com.github.nscala_time.time.Imports._
I can create new datatime e.g. val test = new DateTime
but when I run
case class Trip(
pickupTime: DateTime,
dropoffTime: DateTime,
pickupLoc: Point,
dropoffLoc: Point)
I get the error: error: not found: type DateTime

Could you please give me a tip as to why it would work for creating a new object but not recognised in the case class?
Many thanks,

ch09 - breaking in trimToRegion when featurizing S&P-crudeoil-etc factors

It's breaking when calling trimmed.head._1 in trimToRegion from line 200 in featurize(). I'm still relatively new to scala so I'm not well equipped to investigate what exactly is breaking. I tried only running it on factors1 and also only on factors2 instead of the two concatenated together, and it still breaks both times. factors2 consists of the S&P and NASDAQ data downloaded with the provided shell script -- as opposed to crude oil and us t-bonds copy-and-pasted from investing.com -- so I'm thinking this might be a true bug rather than something I introduced. I'll go hone my scala skills a bit more so I can investigate this further. If I find anything I'll let you know.

spark-submit --class com.cloudera.datascience.risk.RunRisk --master local target/ch09-risk-1.0.2-jar-with-dependencies.jar

Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
        at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
        at scala.collection.IterableLike$class.head(IterableLike.scala:91)
        at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
        at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
        at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
        at com.cloudera.datascience.risk.RunRisk$.trimToRegion(RunRisk.scala:200)
        at com.cloudera.datascience.risk.RunRisk$$anonfun$16.apply(RunRisk.scala:112)
        at com.cloudera.datascience.risk.RunRisk$$anonfun$16.apply(RunRisk.scala:112)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
        at com.cloudera.datascience.risk.RunRisk$.readStocksAndFactors(RunRisk.scala:112)
        at com.cloudera.datascience.risk.RunRisk$.main(RunRisk.scala:34)
        at com.cloudera.datascience.risk.RunRisk.main(RunRisk.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Ch 10 data

Can't get 16 gb dataset:

$ curl  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
curl: (6) Could not resolve host: ftp-trace.ncbi.nih.gov

Ch06 - org.apache.spark.SparkException: Task not serializable

The book example uses the path to wikidump.xml, but the github code is looking at a directory. Where and how was the xml file broken up? I'm getting this error in the preprocessing function when running flatmap.

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:303)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:302)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:302)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$RunLSA$.preprocessing(:234)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$RunLSA$.main(:65)

Additionally, is there any documentation on how to run RunLSA? The book example uses spark-shell but I've had to change a few things to get the github code to play nicely with spark-shell

Problems build Chapter 6 (and also the root)

Hi,
I'm running on Ubuntu 14.04 LTS on an EC2 instance.
ubuntu@ip-10-0-1-186:/aas$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
ubuntu@ip-10-0-1-186:/aas$

When I tried to run $mvn package in ch06-lsa:
ubuntu@ip-10-0-1-186:/aas/ch06-lsa$ mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Wikipedia Latent Semantic Analysis 1.0.0
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for com.cloudera.datascience:common:jar:1.0.0 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.057s
[INFO] Finished at: Tue May 26 21:14:19 UTC 2015
[INFO] Final Memory: 11M/225M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project ch06-lsa: Could not resolve dependencies for project com.cloudera.datascience:ch06-lsa:jar:1.0.0: Failure to find com.cloudera.datascience:common:jar:1.0.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
ubuntu@ip-10-0-1-186:/aas/ch06-lsa$

When I tried to run mvn in the root:

ubuntu@ip-10-0-1-186:/aas$ mvn install
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Advanced Analytics with Spark
[INFO] Advanced Analytics with Spark Common
[INFO] Introduction to Data Analysis with Scala and Spark
[INFO] Recommender Engines with Audioscrobbler data
[INFO] Covtype with Random Decision Forests
[INFO] Anomaly Detection with K-means
[INFO] Wikipedia Latent Semantic Analysis
[INFO] Network Analysis with GraphX
[INFO] Temporal and Geospatial Analysis
[INFO] Value at Risk through Monte Carlo Simulation
[INFO] Genomics Analysis with ADAM
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Advanced Analytics with Spark 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce) @ spark-book-parent ---
[WARNING] Rule 1: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message:
Detected Maven Version: 3.0.5 is not in the allowed range 3.1.1.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Advanced Analytics with Spark ..................... FAILURE [1.570s]
[INFO] Advanced Analytics with Spark Common .............. SKIPPED
[INFO] Introduction to Data Analysis with Scala and Spark SKIPPED
[INFO] Recommender Engines with Audioscrobbler data ...... SKIPPED
[INFO] Covtype with Random Decision Forests .............. SKIPPED
[INFO] Anomaly Detection with K-means .................... SKIPPED
[INFO] Wikipedia Latent Semantic Analysis ................ SKIPPED
[INFO] Network Analysis with GraphX ...................... SKIPPED
[INFO] Temporal and Geospatial Analysis .................. SKIPPED
[INFO] Value at Risk through Monte Carlo Simulation ...... SKIPPED
[INFO] Genomics Analysis with ADAM ....................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.867s
[INFO] Finished at: Tue May 26 21:15:21 UTC 2015
[INFO] Final Memory: 14M/285M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce (enforce) on project spark-book-parent: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
ubuntu@ip-10-0-1-186:/aas$

Any ideas?

Ch10: adamLoad error

Getting the following in the adam-shell:

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> import org.bdgenomics.formats.avro.AlignmentRecord
import org.bdgenomics.formats.avro.AlignmentRecord

scala> val readsRDD: RDD[AlignmentRecord] = sc.adamLoad("genomics/reads/HG00103")
<console>:20: error: value adamLoad is not a member of org.apache.spark.SparkContext
       val readsRDD: RDD[AlignmentRecord] = sc.adamLoad("genomics/reads/HG00103")

Ch 6 -not enough arguments for method plainTextToLemmas:

The following statement from ch 6 generates an error:

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
<console>:59: error: not enough arguments for method plainTextToLemmas: (text: S
tring, stopWords: Set[String], pipeline: edu.stanford.nlp.pipeline.StanfordCoreN
LP)Seq[String].
Unspecified value parameter pipeline.
       val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

The repo code is fine.

val lemmatized = plainText.mapPartitions(iter => {
      val pipeline = createNLPPipeline()
      iter.map{ case(title, contents) => (title, plainTextToLemmas(contents, stopWords, pipeline))}
    })

value foldLeft is not a member of (String, Seq[String])

Ch 6, the following code snippet generates an error.

import scala.collection.mutable.HashMap
val docTermFreqs = lemmatized.map(terms => {
val termFreqs = terms.foldLeft(new HashMap[String, Int]()) {
(map, term) => {
map += term -> (map.getOrElse(term, 0) + 1)
map
}
}
termFreqs
})

The error is

<console>:64: error: value foldLeft is not a member of (String, Seq[String])
       val termFreqs = terms.foldLeft(new HashMap[String, Int]()) {

Update ADAM code to reflect latest API

mvn package getting failed for chapter 6 lsa

Hi ,

I am trying to practice chapter 6. I am trying to build the package as mentioned in the book but i am stuck with the below error.

[ERROR] Failed to execute goal on project ch06-lsa: Could not resolve dependencies for project com.cloudera.datascience:ch06-lsa:jar:1.0.0: Failure to find com.cloudera.datascience:common:jar:1.0.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]

Thanks,
Vishnu

value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

The following code from ch 6 generates error.

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP)
    : Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma) && isOnlyLetters(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

The error is

<console>:37: error: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]
           for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnot
ation])) {
                            ^

ch09 - Missing classes

Inside package com.cloudera.datascience.risk, I am not able to find out these two classes
import com.cloudera.datascience.risk.ComputeFactorWeights._
import com.cloudera.datascience.risk.MonteCarloReturns._

Here is the link https://github.com/sryza/aas/blob/master/ch09-risk/shell.scala#L11-L12 where you are importing these classes.

Chapter 10 build broken by an ADAM change

I just pulled the book source (master 94fa09d) and got the following error when running mvn:

[INFO] --- scala-maven-plugin:3.2.0:compile (default) @ ch10-genomics ---
[INFO] artifact joda-time:joda-time: checking for updates from central
[INFO] /Users/tom/src/scala/aas/ch10-genomics/src/main/scala:-1: info: compiling
[INFO] Compiling 1 source files to /Users/tom/src/scala/aas/ch10-genomics/target/classes at 1424131711406
[ERROR] /Users/tom/src/scala/aas/ch10-genomics/src/main/scala/com/cloudera/datascience/genomics/RunTFPrediction.scala:16: error: object FeaturesContext is not a member of package org.bdgenomics.adam.rdd.features
[ERROR] import org.bdgenomics.adam.rdd.features.FeaturesContext._
[ERROR]                                         ^

followed by 8 others in the same file.

This would appear to be caused by the last commit to ADAM (bigdatagenomics/adam@3f0eadb) which removed the FeaturesContext and GeneContext classes and apparently replaced their functions with the more generic loadFeatures function in the ADAMContext.

there is somethings wrong when i launched ch7-graph

there is nothing wrong when i run mvn assembly:assembly under the folder of ch07-graph.

INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResource
sRatio: 0.8
Exception in thread "main" java.lang.NoClassDefFoundError: com/cloudera/datascience/common/XmlInputFormat
at com.cloudera.datascience.graph.RunGraph$.loadMedline(RunGraph.scala:188)
at com.cloudera.datascience.graph.RunGraph$.main(RunGraph.scala:29)
at com.cloudera.datascience.graph.RunGraph.main(RunGraph.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.cloudera.datascience.common.XmlInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 12 more
16/01/15 16:36:50 INFO spark.SparkContext: Invoking stop() from shutdown hook

Use code format inline for class, method names?

@sryza I notice some cases where you name a class like RowMatrix but it's not in code font, as with +RowMatrix+. Is it worth standardizing on this, to mark class and method names inline in the text as code format?

Chapter 6: java.lang.IllegalArgumentException: No annotator named tokenize

Following the example in chapter 6, I am getting the following error shortly after running: docTermFreqs.flatMap(_.keySet).distinct().count()

It starts splitting input and executing tasks then:
15/07/10 15:42:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $line171.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
edu.stanford.nlp.pipeline.AnnotatorImplementations:
15/07/10 15:42:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $line171.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/07/10 15:42:41 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Adding annotator pos
15/07/10 15:42:41 INFO TaskSchedulerImpl: Cancelling stage 0
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 9.0 in stage 0.0 (TID 9)
15/07/10 15:42:41 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 10.0 in stage 0.0 (TID 10)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 4.0 in stage 0.0 (TID 4)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 11.0 in stage 0.0 (TID 11)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 12.0 in stage 0.0 (TID 12)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 13.0 in stage 0.0 (TID 13)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 5.0 in stage 0.0 (TID 5)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 6.0 in stage 0.0 (TID 6)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 7.0 in stage 0.0 (TID 7)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 14.0 in stage 0.0 (TID 14)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 8.0 in stage 0.0 (TID 8)
15/07/10 15:42:41 INFO DAGScheduler: Job 0 failed: count at :96, took 1.887817 s
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Consider capping line lengths in the repo at the 68 characters the book can display

Complete Acknowledgements

I started adding an Acknowledgements section to the Preface; we'll need to make sure it's complete before finishing.

ch06 - error: value _1 is not a member of scala......

<console>:93: error: value _1 is not a member of scala.collection.mutable.HashMap[String,Int]
       val docIds = docTermFreqs.map(_._1).zipWithUniqueId().map(_.swap).collectAsMap()

The code in the book doesn't define any variables called docIds and I don't see any comments about it in the code. Having trouble debugging this as I'm not sure exactly what this line is trying to accomplish. What does (_._1) mean?

It's a bit frustrating that the code in the book doesn't work and I can't manage to get the code on github to work either. Any help would be much appreciated

Discuss 'inlining' geojson

Let's talk a little about the namespace, API, etc and whether it should be inlined into the book code repo.

For secondary sorting

For this line

https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala#L175

I am wondering if it should be changed to

implicit val ordering: Ordering[(K,S)] = Ordering.by(_._2)

so it can be sorted by the pickup time. If using the _1, it means it will use lic to sort, but in the same partition, it is always the same anyway. I think what we need is to sort by the pickup time within the partition.

Issues in Chapter 8 nscla_time

Hi ,

I tried to add the nscala-time_2.10-1.8.0.jar to spark shell and imported the package. But unfortunately when i use it , i end up with this error.

scala> import com.github.nscala_time.time.Imports._
import com.github.nscala_time.time.Imports._

scala> val dt = new DateTime(2015,2,2,20,0)
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in BuilderImplicits.class refers to term time
in value org.joda which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling BuilderImplicits.class.
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last one.

Can you help me to resolve this.

Include container maven project example

ch06 - value flatmap is not a member of org.apache.spark.rdd.RDD[String]

When trying to follow along with the example in chapter 6, I get an error when trying to convert the xml to plain text.

scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
:42: error: value flatmap is not a member of org.apache.spark.rdd.RDD[String]
val plaintext = rawXmls.flatmap(wikiXmlToPlainText)
^

Any ideas?

Can I run the 'AAS' code from a Zepplin Notebook?

Can I run the 'AAS' Code from Zepplin?
If yes, how do I import the chapter's .jar?

Spark Shell:
~/spark/bin/spark-shell --jars target/ch06-lsa-1.0.0.jar

Zeppelin:
./bin/zeppelin-daemon.sh start
Pid dir doesn't exist, create /home/ubuntu/incubator-zeppelin/run
Zeppelin start [ OK ]

Thanks in advance!

maven build errors for XmlInputFormat.java involving StandardCharsets

Maven build errors for XmlInputFormat.java. Stacktrace indicates maven plugin issue (MojoExecutor)

errors like this
".../common/XmlInputFormat.java:[21,24] cannot find symbol
symbol : class StandardCharsets
location: package java.nio.charset "

I'm on java 7.
java version "1.7.0_71"

rolling back to previous version of XmlInputFormat.java (including Guava) works.

Thanks for any guidance.

ch06 - value containsKey is not a member of scala.collection.immutable.Map[String,Int]

in RunLSA.scala

error: value containsKey is not a member of scala.collection.immutable.Map[String,Int]
case (term, freq) => bTermToId.containsKey(term)

http://www.scala-lang.org/api/2.11.5/index.html#scala.collection.immutable.Map

looks like it should be "contains" instead of "containsKey"

ch06 - value toMap is not a member of org.apache.spark.rdd.RDD[(String, Double)]

When using the book's example code:

val idfs = docFreqs.map{
| case (term, count) => (term, math.log(numDocs.toDouble / count))
| }.toMap

I get back:

:104: error: value toMap is not a member of org.apache.spark.rdd.RDD[(String, Double)]
possible cause: maybe a semicolon is missing before `value toMap'?
}.toMap
^

Currently importing:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
import org.apache.spark.SparkContext._

//lemmatization
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._

//need properties class
import java.util.Properties

//need array buffer class
import scala.collection.mutable.ArrayBuffer

//need rdd class
import org.apache.spark.rdd.RDD

//need foreach
import scala.collection.JavaConversions._

//computing the tf-idfs
import scala.collection.mutable.HashMap

Add neuro code to repo

chapter 6: termDocMatrix not defined

In chapter 6,

termDocMatrix.cache()

This variable isn't defined earlier in the chapter. I haven't been able to find a suitable way to do this without changing everything to be more like RunLSA, which creates a differnt set of issues.

Any assistance appreciated

Issues in running Chapter 6 code

I have packaged the chapter 6 and included the jar using spark-shell.

When I am trying to execute the below code without @transient

@transient val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "")
conf.set(XmlInputFormat.END_TAG_KEY, "")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)

I get Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration .

With transient in place I can proceed further, but after the below transformation

val plainText = rawXmls.flatMap(wikiXmlToPlainText)

I ran a plainText.count

And it gives me the below error.

java.lang.NoClassDefFoundError: com/google/common/base/Charsets
at com.cloudera.datascience.common.XmlInputFormat$XmlRecordReader.(XmlInputFormat.java:79)
at com.cloudera.datascience.common.XmlInputFormat.createRecordReader(XmlInputFormat.java:55)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

Am I missing something here.
I am using spark 1.2 and Hadoop 2.5.2

cant't find XmlInputFormat class for ch6 and ch7

I cant'n find this dependency in mvnrepository or github as described in ch6 and ch7, can you give it to me?
you book is great,and i learn a lot, thanks

<dependency>
  <groupId>com.cloudera.datascience</groupId>
  <artifactId>common</artifactId>
  <version>${project.version}</version>
</dependency>