Giter Site home page Giter Site logo

lohhla's Introduction

This is a fork of the original LOHHLA repository by Maarten Slagter, PhD student at the Netherlands Cancer Institute. Edits are intended to facilitate usage and interpretation of the program and make it more robust to various problems, while keeping the core functionality intact.

Currently, the major additions include:

  • GRCh38 support. Do check your reference genome for presence of any additional alternative HLA sequences that are not hard-coded in the source code and please let me know if you find any. Perhaps these should be made user-definable.
  • Removed the silent assumption of paired-end sequencing data, single-end sequencing is now also supported
  • Included the ability to run LOHHLA on partial/sliced bam files, which only or at least include the HLA region of chromosome 6, by allowing the user to input the number of mapped reads for the tumor and normal .bam files. These two quantities are essential to correctly compare the coverage between the two samples. I myself have used this feature in combination with the TCGA bam slicing tool in order to estimate HLA LOH for TCGA samples without having to download the entire .bam files for each patient. For this analysis, I estimated the total mapped reads using the file size of the .bam file, which is directly accessible via TCGA .
  • General usability, e.g. the ability to explicitly define the tumor and matched normal .bam files. The original currently expects as input a folder with exactly two .bam files: the user defines the normal bam and the script infers that the other one must be the tumor bam, I decided to to change this. The user can define both the normal and tumor bams, allowing any kind of source file organization.
  • More error checks and robustness -- although there's probably still room for improvement here
  • More informative tabular output. The added 'message' column will display why certain alleles may have failed, so you don¿t necessarily have to go through the log files in case a sample/allele fails (due to e.g. homozygosity or lack of coverage in the matched normal bam)
  • Solved some potential numerical problems in the code, e.g. I encountered a division between two vectors of potentially unequal sizes, which lead to an error for some samples
  • Code readability and organization
  • Some additional dependencies were included
  • Plotting code has been minimally altered and remains untested in conjunction with the rest of this codebase

A thank you goes out to Joris van der Haar for spotting some bugs I (Maarten Slagter) introduced. Tested and developed in R 3.5 on Linux.

README

Immune evasion is a hallmark of cancer. Losing the ability to present productive tumor neoantigens could facilitate evasion from immune predation. An integral part of neoantigen presentation is the HLA class I molecule, which presents epitopes to T-cells on the cell surface. Thus, loss of an HLA allele, resulting in HLA homozygosity, may be a mechanism of immune escape. However, the polymorphic nature of the HLA locus precludes accurate copy number calling using conventional copy number tools.

Here, we present LOHHLA, Loss Of Heterozygosity in Human Leukocyte Antigen, a computational tool to evaluate HLA loss using next-generation sequencing data.

LICENCE

LOHHLA IS PROTECTED BY COPYRIGHT AND IS SUBJECT TO A PATENT APPLICATION. THE TOOL IS PROVIDED “AS IS” FOR INTERNAL NON-COMMERCIAL ACADEMIC RESEARCH PURPOSES ONLY. NO RESPONSIBILITY IS ACCEPTED FOR ANY LIABILITY ARISING FROM SUCH USE BY ANY THIRD PARTY.
COMMERCIAL USE OF THIS TOOL FOR ANY PURPOSE IS NOT PERMITTED. ALL COMMERCIAL USE OF THE TOOL INCLUDING TRANSFER TO A COMMERCIAL THIRD PARTY OR USE ON BEHALF OF A COMMERCIAL THIRD PARTY (INCLUDING BUT NOT LIMITED TO USE AS PART OF A SERVICE SUPPLIED TO ANY THIRD PARTY FOR FINANCIAL REWARD) REQUIRES A LICENSE. FOR FURTHER INFORMATION PLEASE EMAIL Eileen Clark [email protected].

What do I need to install to run LOHHLA?

Please ensure a number of dependencies are first installed. These include:

Within R, the following packages are required:

If not available locally, these packages will be attempted to be installed.

LOHHLA also requires an HLA fasta file. This can be obtained from Polysolver (http://archive.broadinstitute.org/cancer/cga/polysolver).

How do I install LOHHLA?

To install LOHHLA, simply clone the repository:

git clone https://bitbucket.org/mcgranahanlab/lohhla.git

How do I run LOHHLA?

LOHHLA is coded in R, and can be executed from the command line (Terminal, in Linux/UNIX/OSX, or Command Prompt in MS Windows) directly, or using a shell script (see example below).

USAGE:

Rscript /location/of/LOHHLA/script  [OPTIONS]

For a description of all the options, run:

Rscript /location/of/LOHHLA/script --help

What is the output of LOHHLA?

LOHHLA produces multiple different files (see correct-example-out for an example). To determine HLA LOH in a given sample, the most relevant output is the file which ends '.HLAlossPrediction CI.xls'. The most relavant columns are:

HLA_A_type1 - the identity of allele 1 HLA_A_type2 - the identity of allele 2 Pval_unique - this is a p-value relating to allelic imbalance LossAllele - this corresponds to the HLA allele that is subject to loss KeptAllele - this corresponds to the HLA allele that is not subject to loss HLA_type1copyNum_withBAFBin - the estimated raw copy number of HLA (allele 1) HLA_type2copyNum_withBAFBin - the estimated raw copy number of HLA (allele 2)

For a full definition of the columns, see below, in each case whether the column should be used [use], or can be ignored [legacy]is indicated:

region								 - the region or tumor sample [use]
HLA_A_type1							 - the identity of allele 1 [use]
HLA_A_type2							 - the identity of allele 2 [use]
HLAtype1Log2MedianCoverage	         - the median LogR coverage across allele 1 [use] 
HLAtype2Log2MedianCoverage	         - the median LogR coverage across allele 2 [use]
HLAtype1Log2MedianCoverageAtSites	 - the median LogR coverage across allele 1, restricted to mismatch sites [use]
HLAtype2Log2MedianCoverageAtSites	 - the median LogR coverage across allele 2, restricted to mismatch sites [use]
HLA_type1copyNum_withoutBAF	         - estimated copy number of allele 1, without using BAF [legacy] 
HLA_type1copyNum_withoutBAF_lower	 - lower 95% confidence interval of estimated copy number of allele 1, without using BAF [legacy] 
HLA_type1copyNum_withoutBAF_upper	 - upper 95% confidence interval of estimated copy number of allele 1, without using BAF [legacy] 
HLA_type1copyNum_withBAF	         - estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type1copyNum_withBAF_lower	     - lower 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type1copyNum_withBAF_upper	     - upper 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type2copyNum_withoutBAF	         - estimated copy number of allele 2 without using BAF  [legacy] 
HLA_type2copyNum_withoutBAF_lower	 - lower 95% confidence interval of estimated copy number of allele 2, without using BAF [legacy] 
HLA_type2copyNum_withoutBAF_upper	 - upper 95% confidence interval of estimated copy number of allele 2, without using BAF [legacy] 
HLA_type2copyNum_withBAF	         - estimated copy number of allele 2 using BAF, without binning sites [legacy] 
HLA_type2copyNum_withBAF_lower	     - lower 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type2copyNum_withBAF_upper	     - upper 95% confidence interval of estimated copy number of allele 1 using BAF, without binning sites [legacy] 
HLA_type1copyNum_withoutBAFBin	     - estimated copy number of allele 1 using binning, but without BAF [legacy]  
HLA_type1copyNum_withoutBAFBin_lower - lower 95% confidence interval of estimated copy number of allele 1 using binning, but without BAF [legacy]  	
HLA_type1copyNum_withoutBAFBin_upper - upper 95% confidence interval of estimated copy number of allele 1 using binning, but without BAF [legacy] 	
HLA_type1copyNum_withBAFBin	         - estimated copy number of allele 1 using binning and BAF [use] 
HLA_type1copyNum_withBAFBin_lower	 - lower 95% confidence interval of estimated copy number of allele 1 using binning and BAF [use] 
HLA_type1copyNum_withBAFBin_upper	 - upper 95% confidence interval of estimated copy number of allele 1 using binning and BAF [use]  
HLA_type2copyNum_withoutBAFBin	     - estimated copy number of allele 2 using binning, but without BAF [legacy]  
HLA_type2copyNum_withoutBAFBin_lower - lower 95% confidence interval of estimated copy number of allele 2 using binning, but without BAF [legacy]	
HLA_type2copyNum_withoutBAFBin_upper - upper 95% confidence interval of estimated copy number of allele 2 using BAF, without binning sites [legacy] 	
HLA_type2copyNum_withBAFBin	         - estimated copy number of allele 2 using binning and BAF [use] 
HLA_type2copyNum_withBAFBin_lower	 - lower 95% confidence interval of estimated copy number of allele 2 using binning and BAF [use] 
HLA_type2copyNum_withBAFBin_upper	 - upper 95% confidence interval of estimated copy number of allele 2 using binning and BAF [use
PVal                                 - p-value relating to difference in logR between allele 1 and allele 2 (paired t-test)[legacy]
UnPairedPval	                     - p-value relating to difference in logR between allele 1 and allele 2 (unpaired t-test)[legacy]
PVal_unique	                         - p-value relating to difference in logR between allele 1 and allele 2, ensuring each read only contributes once (paired t-test) [use]
UnPairedPval_unique                  - p-value relating to difference in logR between allele 1 and allele 2, ensuring each read only contributes once (unpaired t-test) [use]
LossAllele	                         - HLA allele that is present at lower frequency (potentially subject to loss) [use]
KeptAllele                           - HLA allele that is present at higher frequency (potentially not subject to loss) [use]
numMisMatchSitesCov                  - number of mismatch sites with sufficient coverage [use]
propSupportiveSites                  - proportion of missmatch sites that are consistent with loss or allelic imbalance [use]

How can I test if LOHHLA is working?

Example data is included in the LOHHLA repository. To run LOHHLA on the example dataset, alter the "example.sh" script to match your local file structure and ensure the requisite dependencies are available / loaded. The --HLAfastaLoc, --gatkDir, and --novoDir file paths should also be updated to the corresponding locations. File paths must be full paths. Run "example.sh" and the output should match that found in the correct-example-out directory provided. All BAM files (normal and tumour) should be found in or linked to the same directory.

Who do I talk to?

If you have any issues with LOHHLA, please send an email to [email protected]

How do I cite LOHHLA ?

If you use LOHHLA in your research, please cite the following paper:

McGranahan et al., Allele-Specific HLA Loss and Immune Escape in Lung Cancer Evolution, Cell (2017), https://doi.org/10.1016/j.cell.2017.10.001

lohhla's People

Contributors

mcgran01 avatar nmcgranahan avatar raerose01 avatar slagtermaarten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

lohhla's Issues

Issue of running LOHHLA with example data

Hello,

LOHHLA is a wonderful tool!
I ran into an issue with running LOHHLA with example data.

My command:
Rscript LOHHLAscript.R
--LOHHLA_loc=/home/shpc_10066/lohhla/
--patientId test
--outputDir /home/shpc_10066/lohhla/example-file/output
--normalBAMfile /home/shpc_10066/lohhla/example-file/bam/example_BS_GL_sorted.bam
--tumorBAMfile /home/shpc_10066/lohhla/example-file/bam/example_tumor_sorted.bam
--BAMDir /home/shpc_10066/lohhla/example-file/bam/
--hlaPath /home/shpc_10066/lohhla/example-file/hlas
--HLAfastaLoc /home/shpc_10066/lohhla/abc_complete.fasta
--CopyNumLoc /home/shpc_10066/lohhla/example-file/solutions.txt
--mappingStep TRUE
--minCoverageFilter 10
--fishingStep TRUE
--cleanUp TRUE
--gatkDir /home/shpc_10066/anaconda3/envs/lohhla/share/picard-2.5.0-2/
--novoDir /home/shpc_10066/anaconda3/pkgs/novoalign-3.07.00-0/bin/
Output was listed in the record file.
record.txt

I'm really at a loss as to how to proceed, and any guidance would be much appreciated!
Thank you for your kind help!

Different results between the slagtermaarten/LOHHLA and mcgranahanlab/lohhla

Hi, I ran the example data with slagtermaarten/LOHHLA and mcgranahanlab/lohhla, and got different results. The PVal_unique=1.01E-24 and UnPairedPval_unique=1.40E-27 from mcgranahanlab/lohhla, but PVal_unique=0.32 and UnPairedPval_unique=0.72 from slagtermaarten/LOHHLA. And they are both differ from the results from correct-example-out. Do you have any idea about that?

suggestions for the script

Hi! Thanks for your fork provides the option for hg38! But I think there are some bugs in your latest script.

For line 627 and 628, I think they should be removed, otherwise, the command --plottingStep can not be used if the output directory does not exist.

if (plottingStep)
stopifnot(dir.exists(figureDir))

Also, in the plotting parts, some parameters' names are not completed which leads the incomplete of the figure result. (All the plot which xlab = 'HLA genomic position')
For example:
plot(c(1:max(HLA_A_type1normal$V2, HLA_A_type2normal$V2))
, lim = c(0, max(HLA_A_type1normalCov, HLA_A_type2normalCov)), col ='#3182bd99', pch = 16
, lab = 'HLA genomic position'
, lab = 'Coverage'
, ain = c(paste("HLA normal coverage", sample))
, ex = 0.75
, ype = 'n'
, as = 1)

error of LOHHLA

Dear authors, when running LOHHLA, the following error ocurrs to some of the samples.
I do not understand the warning of "Run with coverageStep completed first".
My code and the warning are pasted bellow, and how can I do to get the correct result.

Rscript /root/lohhla/LOHHLAscript.R --patientId 1218-1 --outputDir /data/1218-1/lohhla-normal/
--normalBAMfile /data/1218-1/1218-2_BS_GL.bam
--BAMDir /data/1218-1/
--hlaPath /data/hla_format/1218-2.hla_format
--HLAfastaLoc /root/lohhla/data/abc_complete.fasta
--CopyNumLoc /data/purity/1218-1.purity
--mappingStep TRUE --minCoverageFilter 10 --fishingStep TRUE --cleanUp FALSE
--HLAexonLoc /root/lohhla/data/hla.dat
--gatkDir /picard --novoDir /novocraft
--coverageStep TRUE

Warning messages:
1: Run with coverageStep completed first -- skipping 1218-1-hla_a!
2: Run with coverageStep completed first -- skipping 1218-1-hla_b!
3: Run with coverageStep completed first -- skipping 1218-1-hla_c!
Error in is.data.frame(x) : object 'combinedTable' not found
Calls: write.table -> is.data.frame
Execution halted

Thank you very much!

Discrepancy with example run

When running this modified script on the example file, I am getting discrepancies in the 'PVal_unique' and 'UnPairedPval_unique' from the solution in the example-file/correct-example-out/example.10.DNA.HLAlossPrediction_CI.xls file. Specifically, the p-values from the modified run are 0.32 and 0.72, respectively, while the original run has p-values of 8e-26 and 1.5e-28. Notably, the 'PVal' and 'UnPairedPval' columns appear similar (but not identical), as do the other columns.

Have the authors encountered a similar issue? I've attached the list of conda packages I have loaded when running LOHHLA and my output.

Thanks!

conda_packages_lohhla.txt

example.10.DNA.HLAlossPrediction_CI.20200204.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.