Giter Site home page Giter Site logo

jamesbconner / spark-examples Goto Github PK

View Code? Open in Web Editor NEW

This project forked from googlegenomics/spark-examples

0.0 1.0 1.0 364 KB

Apache Spark jobs such as Principal Coordinate Analysis.

License: Apache License 2.0

Scala 88.81% Python 11.19%

spark-examples's Introduction

spark-examples Build Status

The projects in this repository demonstrate working with genomic data accessible via the Google Genomics API using Apache Spark.

If you are ready to start coding, take a look at the information below. But if you are looking for a task-oriented list (e.g., How do I compute principal coordinate analysis with Google Genomics?), a better place to start is the Google Genomics Cookbook.

Getting Started

  1. git clone this repository.

  2. If you have not already done so, follow the Google Genomics getting started instructions to set up your environment including installing gcloud and running gcloud init.

  3. Download and install Apache Spark.

  4. Install SBT.

  5. This project now includes code for calling the Genomics API using gRPC. To use gRPC, you'll need a version of ALPN that matches your JRE version.

  6. See the ALPN documentation for a table of which ALPN jar to use for your JRE version.

  7. Then download the correct version from here.

Local Run

From the spark-examples directory run sbt run

Use the following flags to match your runtime configuration:

$ export SBT_OPTS='-Xbootclasspath/p:/YOUR/PATH/TO/alpn-boot-YOUR-VERSION.jar'
$ sbt "run --help"
  -o, --output-path  <arg>
  -s, --spark-master  <arg>      A spark master URL. Leave empty if using spark-submit.
  ...
      --help                     Show help message

For example:

$ sbt "run --spark-master local[4]"

A menu should appear asking you to pick the sample to run:

Multiple main classes detected, select one to run:

 [1] com.google.cloud.genomics.spark.examples.SearchVariantsExampleKlotho
 [2] com.google.cloud.genomics.spark.examples.SearchVariantsExampleBRCA1
 [3] com.google.cloud.genomics.spark.examples.SearchReadsExample1
 [4] com.google.cloud.genomics.spark.examples.SearchReadsExample2
 [5] com.google.cloud.genomics.spark.examples.SearchReadsExample3
 [6] com.google.cloud.genomics.spark.examples.SearchReadsExample4
 [7] com.google.cloud.genomics.spark.examples.VariantsPcaDriver
 
Enter number:

Troubleshooting:

If you are seeing java.lang.OutOfMemoryError: PermGen space errors, set the following SBT_OPTS flag:

export SBT_OPTS='-XX:MaxPermSize=256m'

Run on Google Compute Engine

(1) Build the assembly.

sbt assembly

(2) Deploy your Spark cluster using Google Cloud Dataproc.

gcloud beta dataproc clusters create example-cluster --scopes cloud-platform

(3) Copy the assembly jar to the master node.

gcloud compute copy-files \
  target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar  example-cluster-m:~/

(4) ssh to the master.

gcloud compute ssh example-cluster-m

(5) Run one of the examples.

spark-submit --class com.google.cloud.genomics.spark.examples.SearchReadsExample1 \
  googlegenomics-spark-examples-assembly-1.0.jar

Running PCA variant analysis on GCE

To run the variant PCA analysis on GCE make sure you have followed all the steps on the previous section and that you are able to run at least one of the examples.

Run the example PCA analysis for BRCA1 on the 1000 Genomes Project dataset.

spark-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
  googlegenomics-spark-examples-assembly-1.0.jar

The analysis will output the two principal components for each sample to the console. Here is an example of the last few lines.

...
NA20811		0.0286308791579312	-0.008456233951873527
NA20812		0.030970386921818943	-0.006755469223823698
NA20813		0.03080348019961635	-0.007475822860939408
NA20814		0.02865238920148145	-0.008084003476919057
NA20815		0.028798695736608034	-0.003755789964021788
NA20816		0.026104805529612096	-0.010430718823329282
NA20818		-0.033609576645005836	-0.026655905606186293
NA20819		0.032019557126552155	-0.00775750983842731
NA20826		0.03026607917284046	-0.009102704080927001
NA20828		-0.03412964005321165	-0.025991697661590686
NA21313		-0.03401702847363714	-0.024555217139987182

This pipeline is described in greater detail on How do I compute principal coordinate analysis with Google Genomics?

Debugging

For more information, see https://cloud.google.com/dataproc/faq

spark-examples's People

Contributors

calbach avatar cassiedoll avatar deflaux avatar elmer-garduno avatar garrickevans avatar namn avatar ttriche avatar

Watchers

 avatar

Forkers

khajaasmath786

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.