Giter Site home page Giter Site logo

stephaniehicks / benchmark-kmeans Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 25.77 MB

Repository to benchmark k-means using HDF5 files using (1) sci-kit learn in Python and (2) BiocSklearn in R/Bioconductor

Jupyter Notebook 22.49% R 0.04% HTML 77.47%

benchmark-kmeans's Introduction

Hi there πŸ‘‹

🌱 I work at the intersection of data science and health. Specifically, I develop scalable methods and open-source software for biomedical data analysis leading to an improved understanding of human health and disease. I work to made these spaces more welcoming, diverse, and inclusive.

πŸ‘©β€πŸ’» I am an Associate Professor of Biostatistics at Johns Hopkins, faculty member of the Johns Hopkins Data Science Lab, and affiliations with the Center for Computational Biology, the Department of Genetic Medicine, and the Department of Biochemistry and Molecular Biology.

🎧 I’m also a co-host of the The Corresponding Author podcast, member of the Editorial Board for Genome Biology, an Associate Editor for Reproducibility at the Journal of the American Statistical Association, and co-founder of R-Ladies Baltimore.

πŸ‘©β€πŸ« I have received several awards for my work, including the NIH K99/R00 Pathway to Independence Award, the Teaching in the Health Sciences Young Investigator Award, and the COPSS Leadership Academy from the American Statistical Association (ASA), arguably the statistical profession’s most prestigious award for early career leaders in Statistics and Data Science.

benchmark-kmeans's People

Contributors

shenzhis avatar stephaniehicks avatar

Watchers

 avatar  avatar  avatar

benchmark-kmeans's Issues

Assignment for July 16

Shenzhi's assignment for July 16

To work with Python 2.7 on JHPCE, use module load python/2.7.9.

Modify the Python code in this Python notebook for the following tasks:

  • At the beginning of the file, write Python code to create an HDF5 file out of the iris dataset.
  • Write Python code to save and read in the HDF5 file. This isn't necessary, but more for my own understanding of how to read and write HDF5 files in Python. Save the file to chunk in rows (in this case the observations are the different iris flowers along rows; for gene expression data, we'll want to chunk along columns).
  • Change the code in this that is comparing KMeans() and MiniBatchKMeans() from the Iris dataset to use a dataset loaded from a HDF5 file.
  • Run both clustering methods on the data set. Try different batch sizes in MiniBatchKMeans().

Next, we want to try doing this in R using BiocSklearn. This file should help you get set up with BiocSklearn on JHPCE.

  • Install reticulate and BiocSklearn R packages on cluster.
  • Create a new R Markdown file for our analysis of the iris dataset.
  • Create a Python code chunk in the R Markdown with the Python code above.

2018-07-09 tasks

Shenzhi's assignment for the week of July 9

  • Sign up for Slack and join HicksLab Slack team
  • Learn about the structure of HDF5 files
  • Learn about how to use version control (git) and GitHub
  • Make a GitHub account and become author on repository
  • Read about pre-processing (quality control) and normalization of single-cell gene expression data
  • Read about dimensionality reduction methods (PCA and t-SNE)
  • Read in a single-cell gene expression data set into Python, apply quality control, apply both dimensionality reduction methods to visualize data in two dimensions, apply k-means to gene expression data, visualize results

Assignment for July 13

Shenzhi's assignment for July 13

  • Use the code in this R Markdown to open this dataset in R
  • Read about how to pre-process single cell gene expression data using the scater Bioconductor package by working through the quality control tutorial and normalization on the webpage.
  • Apply quality control metrics to dataset
  • Save the preprocessed and normalized count table as a CSV file and open in Python
  • Run t-SNE and PCA on dataset. Remember, you must remove the mean for each gene before running PCA.
  • Try running k-means on the top 5 PCs.
  • Save dataset as an HDF5 file and learn about how Python loads the dataset (does it load it all into memory?)
  • Compare Kmeans to MiniBatchKMeans in Python see this tutorial.

Assignment for week of 7/30 - 8/3

MiniBatchKMeans:

  • Figure out how MiniBatchKMeans reads data
  • Update code so that clusters have consistent colors
  • Add wider range of batch sizes

TSNE:

  • Run TSNE on mouse data

  • Test multiple perplexities

  • Plot mouse data by ground truth

BiocSklearn:

R/Python KMeans comparison

Assignment for August 6-10

  • Update to BiocSklearn version 1.3.1.
  • Using Iris data set, run k-means on the iris data using the kmeans() function in R and and KMeans() from Python but use the BiocSklearn package. Meaning, make the comparison entirely in R. Create the 3x3 table we discussed in Tuesday's meeting.
  • Similar to the KMeans() function in Python, write an R function to interact with the MiniBatchKMeans() function in Python using BiocSklearn. Suggestion: Vince says to look at code in the R/skClust.R as a starting place.
  • Using Iris data set, run mini-batch k-means in R and Python, compare predicted categories in a 3x3 table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.