Giter Site home page Giter Site logo

immunogenomics / schlapers Goto Github PK

View Code? Open in Web Editor NEW
17.0 14.0 1.0 8.13 MB

Code to run the scHLApers pipeline for personalized single-cell HLA quantification

License: GNU General Public License v3.0

R 2.28% Jupyter Notebook 97.01% Shell 0.71%
alignment expression hla single-cell

schlapers's Introduction

scHLApers

Code to run the scHLApers pipeline for quantifying single-cell HLA expression using personalized reference genomes (Kang et al., Nat Genetics 2023). Overview

Requirements

R program requires (listed version or higher):

  • R=4.0.5
  • Biostrings=2.58.0
  • purrr=0.3.4
  • readr=2.1.2
  • stringi=1.7.8
  • stringr=1.4.0
  • tidyverse=1.3.1
  • rtracklayer=1.50.0

Other software:

Data:

  • Reference genome (e.g. GRCh38.primary_assembly.genome.fa): available here
  • Gene annotation file (e.g. gencode.v38.annotation.gtf): available here
  • Cell barcode whitelist: more info here

Pipeline and example data

Each step has its own directory with necessary scripts and a tutorial walking through the steps. The example_data and example_output directories contain example input and output files for 2 samples. The raw scRNA-seq data for the example was obtained from Yazar et al. Science 2022 study, publicly available on GEO (GSE196830).

Input

The inputs to scHLApers are:

  • Raw scRNA-seq data (either FASTQ or BAM format)
  • HLA allele calls (in CSV format, labeled as "SampleX_alleles.csv", see example_data/inputs/alleles for format)

See the HLA analyses tutorial from Sakaue et al. for protocol for imputing HLA alleles from genotype array data.

Step 1: Prepare HLA allelic sequence database

We provide a pre-prepared database generated from IPD-IMGT/HLA version 3.47 that can be directly used in Step 2. Alternatively, you can prepare your own database using the latest IPD-IMGT/HLA verison following the tutorial.

Step 2: Make personalized reference and annotation files

The tutorial demonstrates how to generate personalized contigs (FASTA) and annotations (GTF) files (that will be combined with the masked reference) and how to mask the reference.

Step 3: Quantify single-cell expression with STARsolo

Example scripts for how to run STARsolo for read alignment and expression quantification in single-cell data. Script will need to be modified based on the specifics of your dataset (e.g. UMI length, input format, barcode whitelist path, STAR executable). Please see the STAR manual for all options.

Outputs

The output of scHLApers is a genes by cells expression matrix, with improved classical HLA expression estimates. In the example output, we have filtered the raw STARsolo counts matrix (to remove empty droplets) using a provided list of cell barcodes (see example_data/cell_meta_example.csv).

The raw counts matrix output by the pipeline for example Sample_1006_1007 can be found here: ../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/Sample_1006_1007_scHLApers_Solo.out/GeneFull_Ex50pAS/raw/UniqueAndMult-EM.mtx

A filtered version is located here (read into R using readRDS): ../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/exp_EM.rds

Note: The classical HLA genes are named IMGT_A, IMGT_C, IMGT_B, IMGT_DRB1, IMGT_DQA1, IMGT_DQB1, IMGT_DPA1, IMGT_DPB1.

Support

For questions and assistance not answered in tutorials, you can contact Joyce Kang (joyce_kang AT hms.harvard DOT edu).

Reproducing results from the manuscript

Code to reproduce the figures and analyses from Kang et al. will become available at https://github.com/immunogenomics/hla2023.

schlapers's People

Contributors

joycekang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schlapers's Issues

Sample_1006_1007_alleles.csv and Sample_1050_1051_alleles.csv

Hi,I am a user of scHLApers, thank you for developing such an excellent tool. Could you please explain how the files Sample_1006_1007_alleles.csv and Sample_1050_1051_alleles.csv on the GitHub website were obtained? For example, where did the 'count' column come from? Also, why is the first column different in the two files, Sample_1006_1007_alleles.csv and Sample_1050_1051_alleles.csv? Are they detecting different genes in different samples?" Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.