Giter Site home page Giter Site logo

bios-823-2019's Introduction

Syllabus for BIOS823 (2019)

BIOS 823 describes the challenges faced by analysts with the increasing importance of large data sets, and the strategies that have been developed in response to these challenges. The core topics are how to manage data and how to make computation scalable. The data management module covers guidelines for working with open data, and the concepts and practical skills for working with in-memory, relational and NoSQL databases. The scalable computing module focuses on asynchronous, concurrent, parallel and distributed computing, as wel as the construction of effective workflows following following DevOps practices. Applications to the analysis of structured, semi-structured and unstructured data, especially from biomedical contexts, will be interleaved into the course. The course examples are primarily in Python and fluency in Python is assumed.

Pre-requisites:

  • Fluency in Python (BIOS821, STA 663 or equivalent)

Course repository is at https://github.com/cliburn/bios-823-2019

Overview and Review

1. Introduction

  • Adminstration
  • Syllabus
  • Python
  • Data science and healthcare

2. Foundations I (Python functional style)

  • Data pipelines
  • Why functional programming?
  • Use of lambdas and higher-order functions
  • Using toolz to build lazy pipelines

3. Foundations II (Python data science stack)

  • Using numpy
  • Using scipy
  • Using pandas
  • Using scikit-learn
  • Using statsmodels
  • Using matplotlib
  • Using seaborn

Data storage and retrieval

4. File types for data storage and ETL

  • Delimited text files
  • JSON
  • XML
  • HDF5
  • Avro
  • Parquet
  • APIs for data sharing
  • Using odo

5. SQL database and warehouse schemas

  • Tuples and set operations
  • The database schema
  • Tables and views
  • Tables, rows, columns, cells
  • Primary keys, foreign keys and referential integrity
  • Normalization for data entry
  • Indexing and optimization
  • Database migrations
  • De-normalization for data query
  • Star schema for data warehouses

6. SQL: Creation and manipulation

  • Why learn SQL
  • The stages of data normalization
  • The CREATE statement
  • The INSERT statement
  • The UPDATE statement
  • Adding indexes
  • ACID
  • Transactions and rollback
  • ETL to populate databases

7. SQL: Basic queries and sub-queries

  • Server vs client side queries
  • The Select statement
  • Projection
  • Filtering on rows
  • Sorting
  • Transforms
  • Grouping
  • Filtering on groups
  • Summarization
  • Sub-queries
  • Using explain

8. SQL: Window queries and UDFs

  • Set operations
  • Joins and semi-joins
  • Window functions
  • User-defined functions (1:1, N:1, 1:N)

9. Key-value and document databases

  • Concepts of NoSQL: From ACID to BASE
  • What is a key-value database?
  • Using redis
  • What is a document database?
  • Using mongodb

10. Column-family and graph databases

  • What is a column family database?
  • Using hbase
  • What is a graph database?
  • Using neo4j
  • Trade-offs (when to use SQL, key-value, document, graph and column family)

Midterm I (10%)

Distributed computing

OPTION: Asynchronous computation

  • Concurrent, parallel and distributed
  • Why asynchronous programming?
  • Latency and resource starvation
  • I/O and computation bottlenecks
  • Generators and Coroutines
  • Coroutines and tasks
  • Sending messages to coroutines, threads and processes
  • The event loop
  • async and await

OPTION: Multi-core computation

  • Amdahl and Gustaffson laws
  • Threads and processes
  • Embarrassingly parallel problems
  • Shared memory issues
  • Deadlocks and race conditions
  • Low level parallel programming with multiprocessing
  • Using concurrent.futures and multiprocessing pools

11. The Hadoop ecosystem, HDFS and YARN

  • Why distributed computing?
  • Google Map-Reduce
  • Hadoop
  • HDFS: Distributed file system
  • YARN: Resource manager
  • MapReduce: Compute engine

12. MapReduce and other tools

  • MapReduce programming
  • Writing a MapReduce program in Python using Streaming
  • Tools for putting data in HDFS (Flume, Sqoop)
  • Tools for SQL access to HDFS (Hive, Impala)
  • Tools for workflow and pipeline construction (Crunch, Oozie, Airflow)
  • Tools for coordination of distributed programs (Zookeeper)
  • NoSQL database (HBase)

13. Dask data frames

  • Dask concents
  • Working with dask DataFrames
  • Dask efficiency

14. Dask arrays, bags and machine learning

  • Working with dask arrays
  • Working with dask bags
  • ML with dask

15. DevOps and cloud computing

  • What is DevOps?
  • Practices and tools
  • Source code control
  • Using Docker containers
  • Walk-through using AWS

16. Spark basics

  • Spark concepts
  • The Spark context
  • The data flow DAG
  • Resilient Distributed Datasets (RDD)
  • Key-value RDDs
  • Creating and saving RDDs
  • Actions and Transforms
  • Caching RDDs
  • Accumulators and Broadcast variables
  • Using UDFs (User Defined Functions)
  • Example: Hello, word count!

17. Spark SQL

  • The Spark session
  • Creating and saving a DataFrame
  • DataFrame operations
  • DataFrame and RDD conversions
  • Using SQL to query a DataFrame
  • Caching a DataFrame
  • Using vectorized UDFs
  • Column family databases
  • Columnar data stores arrow and parquet

18. Spark machine learning

  • Basic statistics with Spark
  • Pipelines
  • Data processing
  • Clustering
  • Classification and regression
  • Collaborative filtering
  • Model selection

19. Spark and streaming data

  • Streaming concepts
  • StreamingContext
  • Discretized Streams
  • Sources of data
  • Transforms
  • Checkpoints
  • DataFrame operations
  • Machine learning operations
  • Processing event logs

Midterm II (10%)

Data analysis

OPTION Structured including geographical

  • Structured data using dask
  • Statistical visualization with seaborn, plotly, bokeh
  • From long/lat to x/y coordinates
  • Interactive mapping with datashader

OPTION Free text

  • Concepts of text analysis
  • From text to matrix
  • Natural language processing with nltk, spacy
  • Topic modeling with spacy and gensim
  • Sentiment classification

OPTION Images

  • Concepts of image processing
  • Using scikit-image
  • Using a CNN to classify images

OPTION Time series

  • Concepts of time series analysis
  • Using statsmodels
  • Using prophet

OPTION Graphs and networks

  • Concepts of graph and network analysis
  • Using networkx
  • Using neo4j
  • Using Spark GraphFrames

OPTION Genomics

  • Concepts of genomic processing
  • Unix pipelines
  • Distributed processing with Adam
  • Example: counting k-mers

Final Exam (30%)

bios-823-2019's People

Contributors

cliburn avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.