Light

stephaniehicks / benchmark-kmeans Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 25.77 MB

Repository to benchmark k-means using HDF5 files using (1) sci-kit learn in Python and (2) BiocSklearn in R/Bioconductor

Jupyter Notebook 22.49% R 0.04% HTML 77.47%

benchmark-kmeans's Introduction

Hi there 👋

🌱 I work at the intersection of data science and health. Specifically, I develop scalable methods and open-source software for biomedical data analysis leading to an improved understanding of human health and disease. I work to made these spaces more welcoming, diverse, and inclusive.

👩‍💻 I am an Associate Professor of Biostatistics at Johns Hopkins, faculty member of the Johns Hopkins Data Science Lab, and affiliations with the Center for Computational Biology, the Department of Genetic Medicine, and the Department of Biochemistry and Molecular Biology.

🎧 I’m also a co-host of the The Corresponding Author podcast, member of the Editorial Board for Genome Biology, an Associate Editor for Reproducibility at the Journal of the American Statistical Association, and co-founder of R-Ladies Baltimore.

👩‍🏫 I have received several awards for my work, including the NIH K99/R00 Pathway to Independence Award, the Teaching in the Health Sciences Young Investigator Award, and the COPSS Leadership Academy from the American Statistical Association (ASA), arguably the statistical profession’s most prestigious award for early career leaders in Statistics and Data Science.

benchmark-kmeans's People

Contributors

Watchers

benchmark-kmeans's Issues

Assignment for July 16

Shenzhi's assignment for July 16

To work with Python 2.7 on JHPCE, use module load python/2.7.9.

Modify the Python code in this Python notebook for the following tasks:

At the beginning of the file, write Python code to create an HDF5 file out of the iris dataset.
Write Python code to save and read in the HDF5 file. This isn't necessary, but more for my own understanding of how to read and write HDF5 files in Python. Save the file to chunk in rows (in this case the observations are the different iris flowers along rows; for gene expression data, we'll want to chunk along columns).
Change the code in this that is comparing KMeans() and MiniBatchKMeans() from the Iris dataset to use a dataset loaded from a HDF5 file.
Run both clustering methods on the data set. Try different batch sizes in MiniBatchKMeans().

Next, we want to try doing this in R using BiocSklearn. This file should help you get set up with BiocSklearn on JHPCE.

Install reticulate and BiocSklearn R packages on cluster.
Create a new R Markdown file for our analysis of the iris dataset.
Create a Python code chunk in the R Markdown with the Python code above.

2018-07-09 tasks

Shenzhi's assignment for the week of July 9

Sign up for Slack and join HicksLab Slack team
Learn about the structure of HDF5 files
Learn about how to use version control (git) and GitHub
Make a GitHub account and become author on repository
Read about pre-processing (quality control) and normalization of single-cell gene expression data
Read about dimensionality reduction methods (PCA and t-SNE)
Read in a single-cell gene expression data set into Python, apply quality control, apply both dimensionality reduction methods to visualize data in two dimensions, apply k-means to gene expression data, visualize results

Assignment for July 13

Shenzhi's assignment for July 13

Use the code in this R Markdown to open this dataset in R
Read about how to pre-process single cell gene expression data using the scater Bioconductor package by working through the quality control tutorial and normalization on the webpage.
Apply quality control metrics to dataset
Save the preprocessed and normalized count table as a CSV file and open in Python
Run t-SNE and PCA on dataset. Remember, you must remove the mean for each gene before running PCA.
Try running k-means on the top 5 PCs.
Save dataset as an HDF5 file and learn about how Python loads the dataset (does it load it all into memory?)
Compare Kmeans to MiniBatchKMeans in Python see this tutorial.

Assignment for week of 7/30 - 8/3

MiniBatchKMeans:

Figure out how MiniBatchKMeans reads data
Update code so that clusters have consistent colors
Add wider range of batch sizes

TSNE:

Run TSNE on mouse data
Test multiple perplexities
Plot mouse data by ground truth

BiocSklearn:

Submit issue on https://github.com/vjcitn/BiocSklearn about SkleanrEls() only having Sklearn.decomposition package

R/Python KMeans comparison

Assignment for August 6-10

Update to BiocSklearn version 1.3.1.
Using Iris data set, run k-means on the iris data using the kmeans() function in R and and KMeans() from Python but use the BiocSklearn package. Meaning, make the comparison entirely in R. Create the 3x3 table we discussed in Tuesday's meeting.
Similar to the KMeans() function in Python, write an R function to interact with the MiniBatchKMeans() function in Python using BiocSklearn. Suggestion: Vince says to look at code in the R/skClust.R as a starting place.
Using Iris data set, run mini-batch k-means in R and Python, compare predicted categories in a 3x3 table.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.