diltheylab / hla-la Goto Github PK

View Code? Open in Web Editor NEW

120.0 12.0 41.0 1.26 MB

Fast HLA type inference from whole-genome data

License: GNU General Public License v3.0

C++ 79.07% Perl 20.06% Shell 0.01% Makefile 0.27% Prolog 0.60%

hla-la's Introduction

HLALA (formerly HLAPRG:LA)

News

(19 July 2023) The T2T reference genome (chm13v2.0.fa) has now been added to the list of supported references.

(10 June 2019) If you work with CRAM files and receive error messages like Unable to fetch reference #5 28687068..28707018, the reason is that samtools view needs to know which reference genome was used to create your CRAM file. Solution: use the switch --samtools_T to specify a suitable reference genome file. Example: ./HLA-LA.pl --BAM NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram --graph PRG_MHC_GRCh38_withIMGT --sampleID NA12878 --samtools_T /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa. You need a fresh pull from GitHub for this to work, the bioconda version has not been updated yet.

(24 February 2019) HLA*LA can now be installed via bioconda! conda install hla-la!

(07 January 2019) The tool is now called HLA*LA. To preserve backward compatibility, we've created a symlink HLA-LA.pl pointing to HLA-LA.pl.

(21 March 2018) We have added an experimental mode for long-read typing (see below) and support for HLA typing of assemblies (see HLA-ASM.md).

(26 September 2017) I've added an additional B38 reference.

(22 July 2017) There is a now a 316MB "NA12878 mini" CRAM test file for the impatient. See below for the download link.

(18 April 2017) The main executable is now called HLA-LA.pl (instead of inferHLATypes.pl). Also, it is now easier to support a central cluster installation of HLA*LA (see section "Cluster installation" below). Before pulling the update, rename your local copy of the paths.ini file, then do a git pull, and then copy the paths from your local copy into the freshly pulled paths.ini (but be careful not to delete the new workingDir entry in the freshly pulled paths.ini).

(05 April 2017) I've added two new B37 reference files and enabled genotyping for more HLA loci. This is still experimental. In particular, treat calls for HLA-DRB3/4 with caution. The model does not try to estimate copy number for these genes!

(03 April 2017) I've received the first user requests for the inclusion of additional references (UCSC hg19 in this case). Also, the pipeline currently uses an old version of Picard. I plan to update the pipeline to a recent version of Picard, and include additional references in the database as well. Let me know if there are further issues that need to be addressed!

Basics

HLA*LA carries out HLA typing based on a population reference graph and employs a new linear projection method to align reads to the graph. See the publication and this blog post.

HLA*LA is faster and less resource-intensive than HLA*PRG.

Installing HLA*LA

Via bioconda

If you're using bioconda, a simple conda install hla-la should be sufficient to install HLA*LA and all dependencies.

Note that you will still have to download the data packages and index the graph; the required commands will be shown at the end of the bioconda installation process.

Manual installation

Prerequisites

g++ with support for C++11 (e.g. 4.7.2)
Boost >= 1.59
Bamtools (now tested with 2.5.1 -- the makefile also contains instructions for older versions)
libz

bwa >= 0.7.12
samtools >= 1.3
picard

Compilation

Create a directory structure for HLA*LA:

mkdir HLA-LA HLA-LA/bin HLA-LA/src HLA-LA/obj HLA-LA/temp HLA-LA/working HLA-LA/graphs

Clone this repository into HLA-LA/src:

cd HLA-LA/src; git clone https://github.com/DiltheyLab/HLA-LA.git .

Compile: modify the paths to libraries and includes in the makefile and then

make all

Instead of modifying the makefile, you can also specify paths for Boost and bamtools via the command line, like so:

make all BOOST_PATH=/data/projects/phillippy/software/boost_1_60_0 BAMTOOLS_PATH=/data/projects/phillippy/software/bamtools

(This will then use/include the files in $BOOST_PATH/include, $BOOST_PATH/lib, $BAMTOOLS_PATH/include, $BAMTOOLS_PATH/lib and $BAMTOOLS_PATH/src - if your local installations have a different structure, edit the makefile directly.)

If you receive error messages, see the next section ("Debugging issues with bamtools and boost").

Test that an executable has been created by executing

../bin/HLA-LA --action testBinary

... and you should the following message:

HLA*LA binary functional!

Any other message indicates that there is a problem - if you receive errors about shared libraries, modify your LD_LIBRARY_PATH accordingly.

Debugging issues with bamtools and boost

If you receive messages about missing symbols or files from the Boost and/or bamtools libraries, carry out the following steps.

First, make sure that your bamtools and Boost installations exhibit a standard directory structure. Specifically, this means:

For Boost: if you specified /path/to/boost as BOOST_PATH (either via the command line or directly in the makefile), the paths /path/to/boost/include and /path/to/boost/lib should exist on your system. Furthermore, /path/to/boost/lib should contain the required library files (boost_system, boost_filesystem, boost_random, boost_serialization); on a standard Linux, this would typically translate into the presence of lib + name + .a files in this folder (e.g.: for library boost_system, on a standard Linux there should be a libboost_system.a file).
For bamtools: if you specified /path/to/bamtools/ as BAMTOOLS_PATH (either via the command line or directly in the makefile), the paths /path/to/bamtools/include/bamtools, /path/to/bamtools/src and /path/to/bamtools/lib64 should exist on your system. If /path/to/bamtools/lib64 does NOT exist on your system, but /path/to/bamtools/lib does, edit line 11 of the makefile accordingly. Furthermore, the specified bamtools library directory (either /path/to/bamtools/lib64 or /path/to/bamtools/lib) should contain a bamtools library file. On most Linux machines, this is equivalent to there being a libbamtools.a file.

If your directory structure looks good and you have verified the existence of the required library files, but continue to receive error messages about missing files or missing symbols during the linking process, you can try directly including the required files. To do so, for the library file you want to include directly, identify the corresponding -lLIBRARY switch in line 14 of the makefile, and substitute it with the path to the corresponding library file.

For example, if you want to directly include the bamtools library file, remove -lbamtools from line 14, and substitute it with /path/to/bamtools/lib64/libbamtools.a.

Download the data package

Download the data package (http://www.well.ox.ac.uk/downloads/PRG_MHC_GRCh38_withIMGT.tar.gz, 2.3G) and extract it into HLA-PRG/graphs, i.e.

cd HLA-LA/graphs
wget http://www.well.ox.ac.uk/downloads/PRG_MHC_GRCh38_withIMGT.tar.gz
tar -xvzf PRG_MHC_GRCh38_withIMGT.tar.gz

md5sum for PRG_MHC_GRCh38_withIMGT.tar.gz is 525a8aa0c7f357bf29fe2c75ef1d477d.

Modifying paths.ini

HLA*LA makes use of bwa, samtools and picard for various steps of the inference process. It is recommended to manually specify the paths to the right executables in the file HLA-LA/src/paths.ini - the cloned repository will contain an example file, and the format should be self-explanatory.

Note that you can specify multiple alternatives per program, for example for running HLA*LA in a heterogeneous environment (HLA*LA will always use the first alternative present, and also try a which if none of the specified alternatives are present).

In cluster environments, you might also want to modify the workingDir entry in the paths.ini file. workingDir only accepts one value (even if you specify multiple values via a comma-separated list, only the first value will be used), and you can use the string $HLA-LA-DIR to refer to the HLA-LA installation directory. If you delete the workingDir line, users have to pass a working directory via the --workingDir argument.

Index the graph

Finally, pre-compute the graph index structure - this can take a few hours and might take up to 40G of memory:

../bin/HLA-LA --action prepareGraph --PRG_graph_dir ../graphs/PRG_MHC_GRCh38_withIMGT

Testing your installation

Download and index the "NA12878 mini" test CRAM file from https://www.dropbox.com/s/xr99u3vqaimk4vo/NA12878.mini.cram?dl=0 (md5sum 45d1769ffed71418571c9a2414465a12; 316M), rename the file to remove the dl=0 suffix (mv NA12878.mini.cram?dl=0 NA12878.mini.cram), run HLA*LA (see below), and compare the output with https://github.com/DiltheyLab/HLA-LA/blob/master/NA12878_example_output_G.txt.

Commands:

samtools index NA12878.mini.cram
./HLA-LA.pl --BAM NA12878.mini.cram --graph PRG_MHC_GRCh38_withIMGT --sampleID NA12878 --maxThreads 7

All allele calls should agree, and Q should be 1 for all calls.

Running HLA*LA

./HLA-LA.pl --BAM /path/to/indexed.bam --graph PRG_MHC_GRCh38_withIMGT --sampleID $mySampleID --maxThreads 7

A few notes:

All output goes into ../working/$mySampleID (where $mySampleID is a variable). Use a unique sample ID for each sample.
If you want the output to go into a different directory, you can also pass the --workingDir argument. HLA-LA will create a separate sub-directory (the name of which will be equal to --sampleID) in the --workingDir directory.
You can also specify a CRAM file.
Both CRAM and BAM files need to be indexed.
If you experience with CRAM files, try specifying a suitable reference genome file via the command line switch --samtools_T.
Modify --maxThreads 7 according to your needs.
HLA*LA tries to automatically figure out the right reference genome for your BAM. It compares the index of your BAM/CRAM with a database of known references, that contains the regions relevant for HLA typing. Reads from these regions are extracted and processed. We currently have support for various versions of B37 and for the 1000 Genomes GRCh38 reference. If the program complains that it cannot find a compatible entry in its internal database, please get in touch - adding more references is easy (see below), and we want to support as wide a range of popular references as possible!
You do not need to modify the utilized graph depending on whether your BAM uses GRCh37 or GRCh38.

Cluster installation

If you want to create a central installation of HLA-LA, you will probably want to delete the workingDir entry from the paths.ini file -- this forces users to specify a working directory via the command line and avoids shared access to the same working directory.

CRAM files

If you use CRAM input, make sure that your CRAM file contains all of the original sample reads, including the unmapped ones (which are typically enriched for HLA-derived reads). We've sometimes come across CRAM files for which this hasn't been the case; and the resulting HLA calls were not very good (the coverage statistics in the call file are sometimes, but now always, indicative of such problems).

f you work with CRAM files and receive error messages like Unable to fetch reference #5 28687068..28707018, the reason is that samtools view needs to know which reference genome was used to create your CRAM file. Solution: use the switch --samtools_T to specify a suitable reference genome file. Example: ./HLA-LA.pl --BAM NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram --graph PRG_MHC_GRCh38_withIMGT --sampleID NA12878 --samtools_T /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa.

Long reads

If you want to carry out genotyping from a long-reads BAM, please specify the parameter --longReads ont2d (for Oxford Nanopore reads) or --longReads pacbio (for PacBio reads). Of note, typing will still be carried out at G group resolution. This feature is experimental; accuracy may not match short-read-based typing.

Assembly typing

HLA typing of assemblies is implemented as a separate module. See HLA-ASM.md for details.

Interpreting the output from HLA*PRG:LA

For each sample with ID $mySampleID, the main output file is ../working/$mySampleID/hla/R1_bestguess_G.txt. Columns:

Locus: Locus (HLA-A, HLA-B, etc..).
Chromosome: 1 or 2. Calls are not phased across loci!
Allele: Called allele at G group resolution (exons 2/3 for class I genes, exon 2 for class II genes).
Q1: Quality score (~ probability), between 0 and 1.
Q2: Ignore.
AverageCoverage: Locus average coverage.
CoverageFirstDecile: First decile coverage.
MinimumCoverage: Minimum column coverage for locus.
proportionkMersCovered: Proportion k-mers belonging to the called allele observed in locus input data.
LocusAvgColumnError: Average alignment error for this locus.
NColumns_UnaccountedAllele_fGT0.2: Number of columns with evidence for the presence of alleles not accounted for by the called alleles.
perfectG: Perfect translation from HLA*PRG:LA call to G grop resolution? (1 = Yes).

Checking AverageCoverage for each sample is an important sanity check - AverageCoverage fluctuates between samples and loci, but its expectation depends on average whole-genome coverage. Also, plotting AverageCoverage for all samples and all loci in your cohort is a good idea!

There are three additional quality indicators: Q1, proportionkMersCovered, NColumns_UnaccountedAllele_fGT0.2. Q1 is usually equal to or very close to 1. proportionkMersCovered should always be 1, at least for high-coverage Illumina samples. NColumns_UnaccountedAllele_fGT0.2 is also usually 0, but NColumns_UnaccountedAllele_fGT0.2 != 0 is not a reliable indicator for false calls. Deviations from the "usual" values sometimes indicate the presence of novel alleles. See the original HLA*PRG paper, in which we discussed the predictive value of these indicators in some detail.

If perfectG != 1, you might want to check ../working/$mySampleID/hla/R1_bestguess.txt, which contains the un-translated output from the internal model of the algorithm.

Also, ../working/$mySampleID/reads_per_level.txt gives coverage across the MHC - the coordinate system is that of the PRG itself, i.e. it is not identical to normal genomic coordinates. The file also contains string identifiers for each level, indicating e.g. gene names. We haven't tested the output of this file in any way, but it might be interesting for some applications.

HLA-DRB orthologs

Treat calls for HLA-DRB3/4 with caution. The model does not try to estimate copy number for these genes, so you'll end up with calls for genes that are not present (in interpreting calls for the DRB orthologs, it helps to take into account HLA-DRB1 genotype, which, generally speaking and via linkage, will tell you how many HLA-DRB3/4/5 copies to expect).

Adding further references

Whenever you apply HLA*LA to a BAM file, the algorithm compares the reference genome of the BAM (contig identifiers and contig lengths, using samtools idxstats) to a database of known references. This database tells the algorithm which BAM regions are relevant for HLA typing; it will extract reads from these regions and use them for typing.

If, however, your utilized reference genome is not in the internal database (to re-iterate this point: we're looking for exact matches in terms of contig identifiers and contig lengths!), you will receive an error message.

The easiest thing to do is then to send the output from samtools idxstats to us, and we'll add your reference to the database.

You can also, however, modify the database yourself:

Each graph has its own reference database. Typically your reference directory is HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT/knownReferences.
Each file in this directory represents one reference. The files are tab-separated with header lines and each line shall have the full number of fields (even if they are empty). The required fields are:
- contigID: Name of the contig
- contigLength: Length of the contig
- ExtractCompleteContig: Requires a value, either 0 (no) or 1 (yes)
- PartialExtraction_Start: Mututally incompatible with ExtractCompleteContig = 1 - if partial extraction, start of the extraction (1-based coordinates).
- PartialExtraction_Stop: Stop coordinate for partial extraction (see PartialExtraction_Start)
For the MHC from chromosome 6, we typically use partial extraction with the MHC coordinates.
MHC ALTs and HLA genomic sequences (if any) are usually completely extracted.
Unmapped reads should usually be extracted. Use '*' (without quotes) as contigID, 0 for contigLength, and 1 for ExtractCompleteContig.

If you create your own extraction files, have a look at the existing files first - they will give you a good example to work from.

Important: if you miss regions, the reads aligned to these are lost for the inference process - make sure to catch all alternative MHC contigs and all HLA references!

Citing HLA*LA

Please cite the HLA*LA publication.

hla-la's People

Contributors

Stargazers

Watchers

Forkers

kpalin biocodings hanpingchen xtmgah asazonov xchromosome219 bgistone aakrosh moutsian jlrgraham23 hmyh1202 santy-8128 evanbiederstedt 0kd jafors flywind2 miko-798 jlanej shulp2211 lifebit-ai cgpu toastyjt awilcz xiaobo199405 jingquanlim rnaimehaom shicheng-guo p15 thekingofall bsonjia irunonayran libingnan11 yinxx jonaszierer tzhang-nmdp atvicenti suraj-adewale suhrig ningshuang-yao eneritz a1aks

hla-la's Issues

only for GRCh38? not Hg19?

I found the graph is named PRG_MHC_GRCh38_withIMGT , but my REF is hg19 ? can i use this soft to anlysis HLA ?

Slow tests

Hello, thanks for great tool.
We're trying to implement it in our studies.

For now tests, described here, working well and our test results are very similar to your GitHub result. But it takes approximately 40 minutes to pass (on NA12878.mini.cram using server with 32 threads and 128 GB RAM) - near 10 minutes for remapping, near 4 minutes for HLA-typing and 25 minutes for intermediate step (I don't fully understand it).

Is it normal? Other HLA-typers we are using (for example xHLA) take a few (2-3) minutes on full WGS sample for typing - that's why I'm asking here. I understand remapping time, but not intermediate step.

Here is our result Test.R1_bestguess_G.txt if it can help to debug.

For some reason there are threads: 1 in all 136 blocks - looks like here is a code line controlling it.

And one more question: according to this line there are some functionality to use only ClassI and 3 ClassII HLA. How can we use this option from HLA-LA.pl run command?

Assertion `sequencesStream.is_open()' failed.

Hi,

I am trying to run HLA-PRG-LA with the test.cram and got Assertion `sequencesStream.is_open()' failed.
I was wondering if any one else have encountered such an issue and knows how it can be fixed?

Thanks,
Wendy

[ Fri Sep 22 15:17:17 2017 ] Remapping done.
HLA-PRG-LA: mapper/processBAM.cpp:1173: void mapper::processBAM::initBAM(std::string, bool, bool): Assertion `sequencesStream.is_open()' failed.
HLA-PRG-LA execution not successful. Command was ../bin/HLA-PRG-LA --action HLA --maxThreads 4 --sampleID test --outputDirectory /HLA.Typing/HLA_PRG_LA/test/test --PRG_graph_dir /hla-prg-la/src/../graphs/PRG_MHC_GRCh38_withIMGT --FASTQ1 /HLA.Typing/HLA_PRG_LA/test/test/R_1.fastq --FASTQ2 /HLA.Typing/HLA_PRG_LA/test/test/R_2.fastq --bwa_bin /bin/bwa --samtools_bin /SAMtools/1.3/bin/samtools --mapAgainstCompleteGenome 0

Sample names with strange characters

I am using HLA-LA within the 1,000 genomes project environment and unfortunately sample names within the BAMs are for example LP000000-DNA_A01 and so your program gives an error. I can reheader using samtools to LP00000DNAA01 and then the program works fine (but in order to do this I have to copy the bam into a writeable file by me first...). As I am trying to run this on thousands of bams, this is becoming very time consuming and memory gobbling! Is there anything I can do to modify the code to allow - and/or _ into sample names?

Any help would be very appreciated!

Helen

update makefile for current bamtools, boost

Hello,

Thank you for providing this excellent software. Would you consider updating your makefile to work with the current bamtools repo and boost libraries in $HOME, to ease installation for new users?

See bamtools.

BOOST_INCLUDE = {$HOME}/boost/boost_1_64_0/install/include
BOOST_LIB = {$HOME}/boost/boost_1_64_0/install/lib
BAMTOOLS_INCLUDE = {$HOME}/bamtools/include/api

BAMTOOLS_SRC = {$HOME}/bamtools/src
BAMTOOLS_LIB = {$HOME}/bamtools/lib

INCS = -I$(BOOST_INCLUDE) -I$(BAMTOOLS_INCLUDE) -I$(BAMTOOLS_SRC)
LIBS = -L$(BOOST_LIB) -L$(BAMTOOLS_LIB) -lboost_random -lboost_filesystem -lboost_system  -lbamtools -lbamtools-utils -lz -lboost_serialization

and

export LD_LIBRARY_PATH="{$HOME}/bamtools/lib:$LD_LIBRARY_PATH"

Let me know if you would like a PR to update the makefile and README. Thanks!

Andrew

indexing - error

Hi
I did the indexing with the following command
../bin/HLA-PRG-LA --action prepareGraph --PRG_graph_dir ../graphs/PRG_MHC_GRCh38_withIMGT

It began, but then was killed: 31800000Killed
Please, help, what shall I do?

Issue with bowtie2

Hello!

Excuse me for possible English mistakes. I've got an issue trying to make HLA typing with HLA-LA on NGS samples, which were aligned to hg38 reference using bowtie2, and HLA-LA gives me following mistake

Have found no compatible reference specifications in somepathonserver/HLA-PRG-LA/src/../graphs/PRG_MHC_GRCh38_withIMGT/knownReferences - create a file for this BAM file and try again. at somepathonserver/HLA-PRG-LA/src/HLA-PRG-LA.pl line 267.

And if I perform HLA typing with HLA-LA on the same samples, which were aligned to the same reference using BWA and it always works correct. Is there any known solution for this problem? I changed chromosome naming from 1 2 3 etc which is used by bowtie2 to chr1 chr2 chr3 used by BWA with samtools reheader but that didn't work out.

I cannot index the graph

So I am trying to set up HLA-LA to work on a ubuntu 18.04 docker machine. I have followed you guide and managed to receive a message "HLA*LA binary functional!".
Now I've downloaded the graph and tried to index it but it always fails at the same point. I though it could be memory related problem so I've pushed memory limit of my image to 64G but it still fails at the same spot. Now please note that I have ignored the "Modifying paths.ini" step until now, but I don't see any connection between these two steps.
Can you think of the reason why my indexing fails?
Thank you!

How to get help?

How to get help from the program? I have tried:

root@5f5e7a4b6c25:/outputs/PRG_MHC_GRCh38_withIMGT# /soft/HLA-PRG-LA/bin/HLA-PRG-LA --help
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check
Aborted (core dumped)
root@5f5e7a4b6c25:/outputs/PRG_MHC_GRCh38_withIMGT# /soft/HLA-PRG-LA/bin/HLA-PRG-LA -h
terminate called after throwing an instance of 'std::runtime_error'
  what():  Please specify arguments --bwa_bin
Aborted (core dumped)
root@5f5e7a4b6c25:/outputs/PRG_MHC_GRCh38_withIMGT# /soft/HLA-PRG-LA/bin/HLA-PRG-LA help
terminate called after throwing an instance of 'std::runtime_error'
  what():  Please specify arguments --bwa_bin
Aborted (core dumped)
root@5f5e7a4b6c25:/outputs/PRG_MHC_GRCh38_withIMGT# /soft/HLA-PRG-LA/bin/HLA-PRG-LA
terminate called after throwing an instance of 'std::runtime_error'
  what():  Please specify arguments --bwa_bin
Aborted (core dumped)

What is wrong?

Wrong version of samtools

root@5f5e7a4b6c25:/outputs/soft/HLA-PRG-LA# samtools --version
samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.

./bin/HLA-PRG-LA --bwa_bin $BWA --samtools_bin /soft/samtools/bin/samtools --BAM /outputs/NA12878.cram --graph ../graphs/PRG_MHC_GRCh38_withIMGT --sampleID NA12878_serge --maxThreads 8 --workingDir ./
....
/home/dilthey/bowtie2-2.2.8/bowtie2 -a -L 25 --omit-sec-seq -x tmp/simulatedGraphs/G1/bowtie2IDX -U tmp/simulatedGraphs/G1/FASTQ/R_1.fq -S tmp/simulatedGraphs/G1/BAM/output_1.bam.sam
1158 reads; of these:
  1158 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    29 (2.50%) aligned exactly 1 time
    1129 (97.50%) aligned >1 times
100.00% overall alignment rate
/soft/samtools/bin/samtools view -bS tmp/simulatedGraphs/G1/BAM/output_1.bam.sam > tmp/simulatedGraphs/G1/BAM/output_1.bam.unsorted
/soft/samtools/bin/samtools sort tmp/simulatedGraphs/G1/BAM/output_1.bam.unsorted tmp/simulatedGraphs/G1/BAM/output_1
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  -@, --threads INT
             Set number of sorting and compression threads [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
terminate called after throwing an instance of 'std::runtime_error'
  what():  Command /soft/samtools/bin/samtools sort tmp/simulatedGraphs/G1/BAM/output_1.bam.unsorted tmp/simulatedGraphs/G1/BAM/output_1 returned code 256
Aborted (core dumped)

Typing of HLA-DP Locus

Hi there,

Thanks for this tool. Is it possible to use HLA*LA to type the HLA-DP locus? When I was reading the paper (https://academic.oup.com/bioinformatics/article/35/21/4394/5426702), I didn't see any mention of the HLA-DP locus in any of the results/benchmarking.

Kind regards,

Fong

Can't start the test

I have indexed the graph and now trying to run the test. But I can't:

root@5f5e7a4b6c25:/outputs/soft/HLA-PRG-LA/bin# ./HLA-PRG-LA --bwa_bin $BWA --samtools_bin $SAMTOOLS --BAM /outputs
/NA12878.cram --graph ../graphs/PRG_MHC_GRCh38_withIMGT --sampleID NA12878_serge --maxThreads 8 --workingDir ./
HLA-PRG-LA: HLA-PRG-LA.cpp:1350: int main(int, char**): Assertion `Utilities::directoryExists(allGraphsOutputDir)' failed.
Aborted (core dumped)

Performance issue

Hello, I am part of Bioinformatics team in Seven Bridges Genomics and we are interested in porting HLA-LA algorithm to our platform. We would appreciate your assistance with some of the issues we encountered.

So far, we have created proper environment via Docker and I successfully ran the algorithm on the provided test .cram file. Nevertheless we are concerned about execution time.
The algorithm was ran on a AWS machine: r4.4xlarge with 16 CPU and 122 GB RAM
Execution lasted for 2 hours, and we've set the maxThreads parameter to 16.
While looking into execution details we found out that our AWS instance used only 1 CPU.

Would you be so kind and help us understand what could be the reason of this multithreading failure? Is there some module or package we are missing that could help us achieve the full performance?

We are providing you a log file that came out of our AWS instance.
Thank you in advance.

job.err.log

Allowing for non-alphanumeric characters in sample IDs

Hi there,

Is it possible to allow for the --sampleID argument to accept non-alphanumeric characters? I often work with identifiers that contain a - in it and when I try to run for example something like this:

HLA-LA.pl \
        --BAM tumour-sample.bam \
        --graph PRG_MHC_GRCh38_withIMGT \
        --sampleID tumour-sample \
        --maxThreads 1 \
        --workingDir output/hla_la

I get this error:

Please use only alphanumeric characters - \w+ - for --sampleID at /projects/ace_benchmarking/miniconda3/envs/hla-la/bin/HLA-LA.pl line 147.

I could specify a --sampleID devoid of any non-alphanumeric characters, but this adds post-processing step to marry the results with the original sample identifiers, which I would like to avoid.

Kind regards,

Fong

A problem when test pacbio data

When I use pacbio data to test the typing effect of HLA-LA, I can't get the correct answer.
I first searched on https://www.ebi.ac.uk using the registration number PRJNA323611, then selected the Experiment accession which is SRX1837275 to download the data, and the five data files obtained are integrated using the cat command:

I receive the file "pacbio_test.fastq.gz".
Then I use the command "bwa mem -t 7 -M -x pacbio /mnt/data/data/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa pacbio_test.fastq.gz |samtools view -bS | samtools sort -o pacbio_test_sort.bam -" to process the data and get the sorted bam file. And I index the bam file.
Next, I use the command "./HLA-LA.pl --BAM /mnt/data/data/pacbio/pacbio_test_sort.bam --graph PRG_MHC_GRCh38_withIMGT --sampleID pacbio_test --maxThreads 7 --longReads pacbio --samtools_T /mnt/data/data/NA12878_final/GRCh38_full_analysis_set_plus_decoy_hla.fa" to run the HLA-LA, I can only get the result like this:

The file "hla" is empty.
I can't get the right typing result. Would you be so kind and help me to solve this problem? The log of command execution is attached to the attached file.

if I have fastq paired read data, what should i do?

I noticed that your code extract bam to fastq first, and mapping with your reference.
if I have fastq already, Should i skip extract bam file and use my fastq as input to your program? How?

A problem about reference?

I downloaded the reference GRCh38 from Ensemble and build the index with bowtie2, but the problem happened like below, anyone can give me some advice?
Have found no compatible reference specifications in /public/home/fanlj/mlf/HLA/HLA-LA/src/../graphs/PRG_MHC_GRCh38_withIMGT/knownReferences - create a file for this BAM file and try again. at ../../HLA-LA/src/HLA-LA.pl line 315

errors running test

Hi,

I am running the smaller test file provided with this command: ../src/HLA-PRG-LA.pl --BAM NA12878.mini.cram --graph ../graphs/PRG_MHC_GRCh38_withIMGT --sampleID NA12878 --maxThreads 7

It returns this error message:
HLA-PRG-LA: mapper/bwa/BWAmapper.cpp:126: void mapper::bwa::BWAmapper::mapLong(std::string, std::string, std::string, bool, std::string): Assertion `(longMode == "pacbio") || (longMode == "ont2d")' failed.

I don't have any longMode input and it should not go to mapLong function. Is there any issues with my command or installation ?

Thanks,
Sarah

User-defined path to graph directory

Hi @AlexanderDilthey,
first of all, thanks for your work, HLA-LA works like a charm, but still there is one thing I'd like to suggest:
I use HLA-LA in various projects and have integrated it in a pipeline mostly based on snakemake and conda. So, whenever I use the pipeline on a new analysis, the conda environments are re-built in the specific project folder. Since HLA-LA fixes the directory of the PRGs to the specific location of installation, I would need to download and index the graph for every project even if I wanted to use the same reference in all of them.

I think that the ability of a user-defined graph directory which allows to use already downloaded and indexed graphs would increase the reusability in this scenario and might be a helpful feature for some users.

I already worked a bit on a possible solution, so if you think this could be a helpful addition to HLA-LA, I would be happy to discuss it more deeply and make a PR.

Thanks in advance!
Best,
Jan

Test CRAM files now missing

Hi
Just to flag up the NA12878 CRAM files are gone from the link in README.MD:

https://gembox.cbcb.umd.edu/shared/NA12878.cram

It says the server has been retired??

Andrew

prepate graph

I downloaded the docker file and incorporated the commands for downloading the tar file, untar and prepare the index
I got the following error
prepareGraph
[ Wed Oct 14 16:16:25 2020 ] Read graph from /usr/local/bin/HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT
30400000Killed
The command '/bin/sh -c /usr/local/bin/HLA-LA/bin/HLA-LA --action prepareGraph --PRG_graph_dir /usr/local/bin/HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT' returned a non-zero code: 137

Improvements to Documentation

In the section titled Basics is a link to a blog post which no longer exists. IMGT is regularly updated. The overview should specify the version of the database used to create the data package (e.g. 3.34.0) to enhance research results reproducibility and identifability. The reference is to the pre-print, but it should instead be to the journal article in Bioinformatics.

execution not successful

Hi! I installed via conda, added a genome file via idxstats (for hg19 with "chr" in the chromosome names, pulling reads from the MHC region on chr6, the chr6 haplotype contigs, and unmapped reads), and started jobs for a couple of whole exome BAM files. One's still running, but two have stopped with a mysterious error.

STDOUT:

HLA-LA.pl

Identified paths:
        samtools_bin: /home/ubuntu/HLA.LA/miniconda3/bin/samtools
        bwa_bin: /home/ubuntu/HLA.LA/miniconda3/bin/bwa
        java_bin: /home/ubuntu/HLA.LA/miniconda3/bin/java
        picard_sam2fastq_bin: /home/ubuntu/HLA.LA/miniconda3/bin/picard
        General working directory: /home/ubuntu/HLA.LA
        Sample-specific working directory: /home/ubuntu/HLA.LA/primary

Extract reads from 9 regions...
Extract unmapped reads...
Merging...
Indexing...
Extract FASTQ...
        /home/ubuntu/HLA.LA/miniconda3/bin/picard SamToFastq VALIDATION_STRINGENCY=LENIENT I=/home/ubuntu/HLA.LA/primary/extraction.bam F=/home/ubuntu/HLA.LA/primary/R_1.fastq F2=/home/ubuntu/HLA.LA/primary/R_2.fastq FU=/home/ubuntu/HLA.LA/primary/R_U.fastq 2>&1

Now executing:
../bin/HLA-LA --action HLA --maxThreads 5 --sampleID primary --outputDirectory /home/ubuntu/HLA.LA/primary --PRG_graph_dir /home/ubuntu/HLA.LA/miniconda3/opt/hla-la/src/../graphs/../graphs/PRG_MHC_GRCh38_withIMGT --FASTQU /home/ubuntu/HLA.LA/primary/R_U.fastq.splitLongReads --FASTQ1 /home/ubuntu/HLA.LA/primary/R_1.fastq --FASTQ2 /home/ubuntu/HLA.LA/primary/R_2.fastq --bwa_bin /home/ubuntu/HLA.LA/miniconda3/bin/bwa --samtools_bin /home/ubuntu/HLA.LA/miniconda3/bin/samtools --mapAgainstCompleteGenome 1 --longReads 0
Set maxThreads to 5
 [ Fri May 17 21:34:02 2019 ] Graph serialization existing and newer than graph file; read from /home/ubuntu/HLA.LA/miniconda3/opt/hla-la/src/../graphs/../graphs/PRG_MHC_GRCh38_withIMGT/serializedGRAPH
 [ Fri May 17 21:39:27 2019 ]   done.

STDERR:

[bam_translate] RG tag "20171025_1728510947" on read "K00277:41:HMCGGBBXX:1:1101:10318:13042" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID.
[bam_translate] PG tag "MarkDuplicates" on read "K00277:41:HMCGGBBXX:1:1101:10318:13042" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID.
 [ Fri May 17 21:39:58 2019 ] processBAM::processBAM(..): Start graph gap analysis.
 [ Fri May 17 21:40:01 2019 ] processBAM::processBAM(..) graph gap analysis: have 3494 graph gap stretches; criterion length >= 3
/home/ubuntu/HLA.LA/miniconda3/bin/bwa mem -t5 -M -a /home/ubuntu/HLA.LA/miniconda3/opt/hla-la/src/../graphs/../graphs/PRG_MHC_GRCh38_withIMGT/extendedReferenceGenome/extendedReferenceGenome.fa /home/ubuntu/HLA.LA/primary/R_1.fastq /home/ubuntu/HLA.LA/primary/R_2.fastq | /home/ubuntu/HLA.LA/miniconda3/bin/samtools view -Sb - > /home/ubuntu/HLA.LA/primary/remapped_with_a.bam.unsorted
terminate called after throwing an instance of 'std::runtime_error'
  what():  Command /home/ubuntu/HLA.LA/miniconda3/bin/bwa mem -t5 -M -a /home/ubuntu/HLA.LA/miniconda3/opt/hla-la/src/../graphs/../graphs/PRG_MHC_GRCh38_withIMGT/extendedReferenceGenome/extendedReferenceGenome.fa /home/ubuntu/HLA.LA/primary/R_1.fastq /home/ubuntu/HLA.LA/primary/R_2.fastq | /home/ubuntu/HLA.LA/miniconda3/bin/samtools view -Sb - > /home/ubuntu/HLA.LA/primary/remapped_with_a.bam.unsorted returned code -1
HLA-LA execution not successful. Command was ../bin/HLA-LA --action HLA --maxThreads 5 --sampleID primary --outputDirectory /home/ubuntu/HLA.LA/primary --PRG_graph_dir /home/ubuntu/HLA.LA/miniconda3/opt/hla-la/src/../graphs/../graphs/PRG_MHC_GRCh38_withIMGT --FASTQU /home/ubuntu/HLA.LA/primary/R_U.fastq.splitLongReads --FASTQ1 /home/ubuntu/HLA.LA/primary/R_1.fastq --FASTQ2 /home/ubuntu/HLA.LA/primary/R_2.fastq --bwa_bin /home/ubuntu/HLA.LA/miniconda3/bin/bwa --samtools_bin /home/ubuntu/HLA.LA/miniconda3/bin/samtools --mapAgainstCompleteGenome 1 --longReads 0

Any idea what this is about?

Thanks,
~Joe

HLA-PRG-LA output on the Exome samples

Hello,

I was just wondering, if there is a way to make the HLA-PRG-LA output generated on exomes for the pre-print https://www.biorxiv.org/content/early/2018/10/26/453555 available.

I am currently testing it out, and got a lower accuracy% for HLA-A and HLA-C from the same 29 samples. So wanted to compare with the results from the pre-print and check which samples are discordant.

Thank you.

Very slow process (5 hours)

Hello,

I'm using HLA-LA and everything seems to be fine (so far, we processed 50 exomes). We run tests with NA12878, NA12155, NA19128, NA11892, NA19127, NA19700, NA12400 (samples sequenced here with Illumina) and predictions considering 4 digits were good (the HLA-DRB3/4 were the most problematic as described in the README of this repo).

The problem is: there is one sample that does not finish the process... when I inspect its stdout I found that it took ~5 hours for one step, as pasted below:

processBAM::alignReads_postSeedExtraction_andStoreInto(): Deal 10000 total read pairs with seeds, of which 1676 are incomplete.
 [ Sun Mar  8 00:36:27 2020 ] Read pair 0 of 8324
 [ Sun Mar  8 00:37:22 2020 ] .. done. Processed  112339 in total (total read IDs: 516844.
 [ Sun Mar  8 00:37:22 2020 ] Process 14/51
 [ Sun Mar  8 00:37:22 2020 ] 	Start seed extraction
 [ Sun Mar  8 00:37:36 2020 ] 			Done extractSeeds2
processBAM::extractSeeds2(): examined 105480 reads, transformed into 10000 seeds.
 [ Sun Mar  8 00:37:36 2020 ] 	Alignment
 [ Sun Mar  8 00:37:36 2020 ] Proto-seed statistics:
	Incomplete: 1679
	Complete: 8321
		Average chains per read: 5.44021
		Average chain length: 98.9056
		Average primary chain length: 99.1694

processBAM::alignReads_postSeedExtraction_andStoreInto(): Deal 10000 total read pairs with seeds, of which 1679 are incomplete.
 [ Sun Mar  8 00:37:36 2020 ] Read pair 0 of 8321
 [ Sun Mar  8 05:41:04 2020 ] .. done. Processed  120660 in total (total read IDs: 516844.
 [ Sun Mar  8 05:41:04 2020 ] Process 15/51
 [ Sun Mar  8 05:41:04 2020 ] 	Start seed extraction
 [ Sun Mar  8 05:41:18 2020 ] 			Done extractSeeds2
processBAM::extractSeeds2(): examined 102128 reads, transformed into 10000 seeds.
 [ Sun Mar  8 05:41:18 2020 ] 	Alignment
 [ Sun Mar  8 05:41:18 2020 ] Proto-seed statistics:
	Incomplete: 1722
	Complete: 8278
		Average chains per read: 5.25562
		Average chain length: 99.2039
		Average primary chain length: 99.1174

had anybody seen this problem already?

Here is the whole stdout file. I killed the process because it was not going forward.

stdout.txt

Software versions:

hla_la_version 1.0.1
samtools_version 1.9
bwa_version 0.7.17
picard_version 1.123
bamtools_version 2.5.1

Edit: the analysis took 22 hours to successfully end.

Discordant results from the same person.

Hi.

I've got discordant results from different tissues of the same person using HLA-LA.
The results are all the same except in DPA1 region: one is DPA102:02:02 and the other is DPA101:11.
Though average coverage is lowest in DPA1 region than other regions, it's above 30X.

I attached the result files.
AA1_P3_R1_bestguess_G.txt
AA1_BM_R1_bestguess_G.txt

I investigated the intermediate files and found that there are summary stats prioritizing likely pairs in "R1_pileup_DPA1.txt" file.

ClusterID P LL Mismatches_avg
DPA101:03:01:01;DPA101:03:01:02;DPA101:03:01:03;DPA101:03:01:04;DPA101:03:01:05/DPA102:02:02 1 -97.8373 143
DPA101:03:01:01;DPA101:03:01:02;DPA101:03:01:03;DPA101:03:01:04;DPA101:03:01:05/DPA102:05 1.88852e-21 -145.556 151
DPA101:03:01:01;DPA101:03:01:02;DPA101:03:01:03;DPA101:03:01:04;DPA101:03:01:05/DPA102:02:06 4.64625e-24 -151.563 151.5
DPA101:03:01:01;DPA101:03:01:02;DPA101:03:01:03;DPA101:03:01:04;DPA101:03:01:05/DPA102:04 4.61435e-25 -153.873 152
DPA101:03:04/DPA102:02:02 1.73353e-32 -170.97 155.5
DPA101:12/DPA102:02:02 7.28223e-38 -183.35 153
DPA101:11/DPA102:02:02 1.9045e-42 -193.902 141
DPA101:10/DPA102:02:02 4.47384e-52 -216.073 159
DPA101:03:04/DPA102:05 3.27271e-53 -218.689 163.5
DPA101:03:04/DPA102:02:06 8.0459e-56 -224.697 164
DPA101:03:04/DPA102:04 7.99056e-57 -227.006 164.5
......

I also attached "R1_pileup_DPA1.txt" files.
AA1_BM_R1_PP_DPA1_pairs.txt
AA1_P3_R1_PP_DPA1_pairs.txt

Can you explain the meaning of the column names (ClusterID, P, LL, and Mismatches_avg) and possible answers for the discordant results?

Thank you!

Is this tool able to type MICA

Hi, can this software be used to type MICA alleles?

bamtools installation problem

Hi I got an error in the mapper/processBAM.cpp:4341 while I was compiling. This is an older version of bamtools, so I did follow the instructions to comment in the include statement that was previously commented out. But the error still persists... Any help is appreciated! Thank you!

~/software/HLA-LA/src/mapper/processBAM.cpp:4341: undefined reference to BamTools::BamReader::GetReferenceID(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const' ../obj/processBAM.o: In function mapper::processBAM::updateBAMSelectors()':

segmentation fault

I'm getting a segmentation error when I ran HLA-LA at the --action HLA step. The stdout just says run was unsuccessful. If I run the part ./bin/HLA-LA --action HLA... specifically, I get that it failed, it just outputs Segmentation fault. This is with my own bam file. I cannot get the example NA12878.mini.cram to work, as it couldn't even get past the picard step, there I got the SAM validation error bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned.

Tool versions
HLA-LA 1.0.1
samtools 1.10
picard 1.123
bwa 0.7.17
bamtools 2.5.1
boost 1.59.0

Log

Identified paths:
	samtools_bin: /opt/samtools/bin/samtools
	bwa_bin: /opt/bwa
	java_bin: /usr/bin/java
	picard_sam2fastq_bin: /opt/picard_tools/SamToFastq.jar
	General working directory: /hla-la
	Sample-specific working directory: /hla-la/test

Extract reads from 2 regions...
Extract unmapped reads...
Merging...
[bam_translate] RG tag "1" on read "A00609:45:HW72GDSXX:3:1101:28800:29105" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID.
[bam_translate] PG tag "MarkDuplicates" on read "A00609:45:HW72GDSXX:3:1101:28800:29105" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID.
Indexing...
Extract FASTQ...
	/usr/bin/java -Xmx10g -XX:-UseGCOverheadLimit -jar /opt/picard_tools/SamToFastq.jar VALIDATION_STRINGENCY=LENIENT I=/hla-la/test/extraction.bam F=/hla-la/test/R_1.fastq F2=/hla-la/test/R_2.fastq FU=/hla-la/test/R_U.fastq 2>&1

Now executing:
../bin/HLA-LA --action HLA --maxThreads 1 --sampleID test --outputDirectory /hla-la/test --PRG_graph_dir /opt/HLA-LA/src/../graphs/PRG_MHC_GRCh38_withIMGT --FASTQU /hla-la/test/R_U.fastq.splitLongReads --FASTQ1 /hla-la/test/R_1.fastq --FASTQ2 /hla-la/test/R_2.fastq --bwa_bin /opt/bwa --samtools_bin /opt/samtools/bin/samtools --mapAgainstCompleteGenome 1 --longReads 0
Set maxThreads to 1
HLA-LA execution not successful. Command was ../bin/HLA-LA --action HLA --maxThreads 1 --sampleID test --outputDirectory /hla-la/test --PRG_graph_dir /opt/HLA-LA/src/../graphs/PRG_MHC_GRCh38_withIMGT --FASTQU /hla-la/test/R_U.fastq.splitLongReads --FASTQ1 /hla-la/test/R_1.fastq --FASTQ2 /hla-la/test/R_2.fastq --bwa_bin /opt/bwa --samtools_bin /opt/samtools/bin/samtools --mapAgainstCompleteGenome 1 --longReads 0

Version of picard

There is one more problem:
https://github.com/broadinstitute/picard/releases/tag/2.9.4
Picard now has one single jar-file! And I can't provide a path to picard_sam2fastq_bin:

root@5f5e7a4b6c25:/outputs/soft/HLA-PRG-LA# ./src/HLA-PRG-LA.pl --picard_sam2fastq_bin "/soft/picard/picard-2.9.0-all.jar SamToFastq"
Command-line supplied value/file for parameter picard_sam2fastq_bin not existing at ./src/HLA-PRG-LA.pl line 368.

The last version of picard with separate JAR-files is picard-tools is 1.123:
https://github.com/broadinstitute/picard/releases/tag/1.123

support for non-paired short reads?

When I run with unpaired reads, I get "You didn't activate --longReads, but the two files ... (which store paired-end reads) are empty - this is weird, and I will abort".

Is there support for non-paired short reads? If not, is this a fundamental issue, or could it be added as an enhancement? I assume just adding the --longReads option isn't a good idea.

bowtie2 is needed but not listed

root@5f5e7a4b6c25:/outputs/soft/HLA-PRG-LA# ./bin/HLA-PRG-LA --bwa_bin $BWA --samtools_bin /soft/samtools/bin/samtools --BAM /outputs/NA12878.cram --graph ../graphs/PRG_MHC_GRCh38_withIMGT --sampleID NA12878_serge --maxThreads 8 --workingDir ./
....
Directory: tmp/simulatedGraphs/G1
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.02 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.01 sec
[main] Version: 0.7.13-r1126
[main] CMD: /soft/bwa.kit/bwa index tmp/simulatedGraphs/G1/referenceGenome/ref.fa
[main] Real time: 0.047 sec; CPU: 0.032 sec
terminate called after throwing an instance of 'std::runtime_error'
  what():  Utilities::findFileFromAlternatives(): No alternative present from C:/Users/AlexanderDilthey/bowtie2-2.2.8, /home/dilthey/bowtie2-2.2.8
Aborted (core dumped)

But bowtie2 is not listed in Prerequisites! And I can't set it's path through an argument! And it is not used from PATH!

samtools idxstats bam header not right

Hi,
i want to use HLA_LA on nanopore longreads bam file.and i use samtools idxstats on my bam and change the header as existing files. but i still get this bug:
Incorrect header for /bio/HLA-LA/HLA-LA/../graphs/PRG_MHC_GRCh38_withIMGT/knownReferences/XH_nanopore_longreads.txt - expect PartialExtraction_Stop, got PartialExtraction_Stop

and i check for many times ,the header shoule be right.
so can you help me. maybe it is a silly problem.
XH_nanopore_longreads.txt
XH.idxstats.txt

Thanks and best,
masen

No compatible reference

Hello

I'm having issues with "no compatible reference" as I think the reference my CRAM files are aligned to is slightly non standard. I wonder if you could generate an additional known reference file from my samtools index?

Many thanks
Seth
idxstats.txt

Data package website not found

Hi Alex,

I was trying to download the data package for HLA-PRG-LA http://www.well.ox.ac.uk/PRG_MHC_GRCh38_withIMGT.tar.gz and unfortunately got this message:

Resolving www.well.ox.ac.uk (www.well.ox.ac.uk)... 129.67.44.50
Connecting to www.well.ox.ac.uk (www.well.ox.ac.uk)|129.67.44.50|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-12-07 01:01:03 ERROR 404: Not Found.

It seems that the link to the data package is no longer available. (The same issue happens to the link of HLA-PRG data package http://www.well.ox.ac.uk/HLA-PRG.tar.gz )
Could you please update the data package link?

add additional reference for Ensembl human assembly 38

Hi, I would like to add this reference. idxstats shows the following:

1 248956422 5335060 0
10 133797422 2096960 0
11 135086622 3007088 0
12 133275309 2220636 0
13 114364328 1559146 0
14 107043718 3120738 0
15 101991189 1346306 0
16 90338345 1713268 0
17 83257441 2357610 0
18 80373285 618100 0
19 58617616 1441954 0
2 242193529 3693058 0
20 64444167 1508904 0
21 46709983 22438086 0
22 50818468 703842 0
3 198295559 3085302 0
4 190214555 1500350 0
5 181538259 2348104 0
6 170805979 3106762 0
7 159345973 2917186 0
8 145138636 2210556 0
9 138394717 2446060 0
MT 16569 147374 0
X 156040895 1218960 0
Y 57227415 37718 0
KI270728.1 1872759 1868 0
KI270727.1 448248 514 0
KI270442.1 392061 232 0
KI270729.1 280839 170 0
GL000225.1 211173 228 0
KI270743.1 210658 322 0
GL000008.2 209709 656 0
GL000009.2 201709 1044 0
KI270747.1 198735 254 0
KI270722.1 194050 82 0
GL000194.1 191469 1550 0
KI270742.1 186739 1588 0
GL000205.2 185591 2370 0
GL000195.1 182896 9600 0
KI270736.1 181920 44 0
KI270733.1 179772 9788058 0
GL000224.1 179693 1092 0
GL000219.1 179198 1092 0
KI270719.1 176845 2440 0
GL000216.2 176608 206 0
KI270712.1 176043 1692 0
KI270706.1 175055 1674 0
KI270725.1 172810 316 0
KI270744.1 168472 794 0
KI270734.1 165050 434 0
GL000213.1 164239 260 0
GL000220.1 161802 9727154 0
KI270715.1 161471 22 0
GL000218.1 161147 3756 0
KI270749.1 158759 206 0
KI270741.1 157432 48 0
GL000221.1 155397 1372 0
KI270716.1 153799 32 0
KI270731.1 150754 414 0
KI270751.1 150742 244 0
KI270750.1 148850 30 0
KI270519.1 138126 18 0
GL000214.1 137718 254 0
KI270708.1 127682 76 0
KI270730.1 112551 12 0
KI270438.1 112505 78 0
KI270737.1 103838 42 0
KI270721.1 100316 1604 0
KI270738.1 99375 70 0
KI270748.1 93321 624 0
KI270435.1 92983 34 0
GL000208.1 92689 30 0
KI270538.1 91309 4 0
KI270756.1 79590 6 0
KI270739.1 73985 4 0
KI270757.1 71251 0 0
KI270709.1 66860 34 0
KI270746.1 66486 154 0
KI270753.1 62944 22 0
KI270589.1 44474 2 0
KI270726.1 43739 24 0
KI270735.1 42811 32 0
KI270711.1 42210 5314 0
KI270745.1 41891 2244 0
KI270714.1 41717 516 0
KI270732.1 41543 20 0
KI270713.1 40745 185894 0
KI270754.1 40191 4 0
KI270710.1 40176 0 0
KI270717.1 40062 1294 0
KI270724.1 39555 6 0
KI270720.1 39050 116 0
KI270723.1 38115 14 0
KI270718.1 38054 966 0
KI270317.1 37690 0 0
KI270740.1 37240 0 0
KI270755.1 36723 54 0
KI270707.1 32032 104 0
KI270579.1 31033 0 0
KI270752.1 27745 2 0
KI270512.1 22689 0 0
KI270322.1 21476 0 0
GL000226.1 15008 0 0
KI270311.1 12399 0 0
KI270366.1 8320 0 0
KI270511.1 8127 0 0
KI270448.1 7992 0 0
KI270521.1 7642 0 0
KI270581.1 7046 0 0
KI270582.1 6504 4 0
KI270515.1 6361 0 0
KI270588.1 6158 0 0
KI270591.1 5796 0 0
KI270522.1 5674 0 0
KI270507.1 5353 0 0
KI270590.1 4685 0 0
KI270584.1 4513 0 0
KI270320.1 4416 0 0
KI270382.1 4215 2 0
KI270468.1 4055 0 0
KI270467.1 3920 16 0
KI270362.1 3530 0 0
KI270517.1 3253 0 0
KI270593.1 3041 0 0
KI270528.1 2983 0 0
KI270587.1 2969 0 0
KI270364.1 2855 0 0
KI270371.1 2805 0 0
KI270333.1 2699 0 0
KI270374.1 2656 0 0
KI270411.1 2646 0 0
KI270414.1 2489 0 0
KI270510.1 2415 0 0
KI270390.1 2387 0 0
KI270375.1 2378 0 0
KI270420.1 2321 0 0
KI270509.1 2318 0 0
KI270315.1 2276 0 0
KI270302.1 2274 4 0
KI270518.1 2186 0 0
KI270530.1 2168 0 0
KI270304.1 2165 0 0
KI270418.1 2145 0 0
KI270424.1 2140 0 0
KI270417.1 2043 0 0
KI270508.1 1951 0 0
KI270303.1 1942 0 0
KI270381.1 1930 0 0
KI270529.1 1899 0 0
KI270425.1 1884 0 0
KI270396.1 1880 0 0
KI270363.1 1803 6 0
KI270386.1 1788 0 0
KI270465.1 1774 0 0
KI270383.1 1750 0 0
KI270384.1 1658 0 0
KI270330.1 1652 20 0
KI270372.1 1650 0 0
KI270548.1 1599 0 0
KI270580.1 1553 6 0
KI270387.1 1537 0 0
KI270391.1 1484 0 0
KI270305.1 1472 0 0
KI270373.1 1451 0 0
KI270422.1 1445 0 0
KI270316.1 1444 0 0
KI270340.1 1428 0 0
KI270338.1 1428 0 0
KI270583.1 1400 0 0
KI270334.1 1368 0 0
KI270429.1 1361 0 0
KI270393.1 1308 0 0
KI270516.1 1300 0 0
KI270389.1 1298 0 0
KI270466.1 1233 12 0
KI270388.1 1216 0 0
KI270544.1 1202 0 0
KI270310.1 1201 0 0
KI270412.1 1179 0 0
KI270395.1 1143 0 0
KI270376.1 1136 0 0
KI270337.1 1121 0 0
KI270335.1 1048 0 0
KI270378.1 1048 0 0
KI270379.1 1045 2 0
KI270329.1 1040 0 0
KI270419.1 1029 0 0
KI270336.1 1026 0 0
KI270312.1 998 0 0
KI270539.1 993 4 0
KI270385.1 990 0 0
KI270423.1 981 0 0
KI270392.1 971 0 0
KI270394.1 970 0 0

however I'm not entirely sure how to format the knownReferences file.

Thanks for the help,
Martin

Question about perfectG value and running on HPC cluster

Hello,

Thank you very much for providing this tool. I was able to run the program successfully on my samples, but I am confused about the perfectG value in the output. In the README doc, it is written:
perfectG: Perfect translation from HLA*PRG:LA call to G grop resolution? (1 = Yes)
This would imply that perfectG = 1 indicates no error, but then it is also written:
If perfectG != 0, you might want to check ../working/$mySampleID/hla/R1_bestguess.txt, which contains the un-translated output from the internal model of the algorithm.
Is this maybe a typo? And how does the R1_bestguess.txt file help in the case where the translation is not perfect?

In addition, I just wanted to let you know that this tool isn't fully compatible with a cluster environment. HLA-LA requires indexing of the graph, but since most users on a cluster can't modify any program files, this can cause some issues. I was able to circumvent this problem by having the HPC cluster administrator grant me write permission to the the graphs directory, but it would be great if the graph directory wasn't hardcoded in the script. Thank you in advance for your help!

Only predict class I MHC alleles

Hi,

I am wondering if adding an option of specifying only class I or II alleles to predict at a time will reduce time/resource usage. If it does, I can perform HLA-PRG-LA on a resource-tensive machine with only 32GB memory.

Thanks.

Why my NA12878 test result is not same with NA12878_example_output_G.txt?

This is my command.
~/HLA-LA/src/HLA-LA.pl --BAM NA12878.mini.cram --graph PRG_MHC_GRCh38_withIMGT --sampleID NA12878 --maxThreads 40

This is my test result.
R1_bestguess_G.txt

Installation via conda fails

Hello,

I have attempted to install HLA-LA via conda by running "conda install -c bioconda hla-la". The installation fails with the following output:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \
Found conflicts! Looking for incompatible packages.                                                                                  failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

As you can see, there is frustratingly no error message describing which packages are incompatible. Regardless, this is in a fresh conda environment just for this package, so I'm not sure what would be triggering the incompatibility. Any idea what might be going on here? This is using conda v4.8.3.

Need a bit more detail in README.md under Test run section

I'm trying to install HLA-PRG-LA on the NIH Biowulf cluster. I'd like to complete the steps in the 'Test run' section. Can you please include the actual command(s) needed to run HLA-PRG-LA on the NA12878.cram dataset?

I'm also very happy that you have provided a dataset for testing and validation. I wish more developers would do so. But the dataset is very large and takes a long time to download. Is it possible to create a smaller dataset for testing? If it's small enough, perhaps it could be packaged as part of the installation so that a separate download isn't necessary.

Thank you.

bamtools installation conflicts with HLA-LA makefile settings

The makefile settings doesn't seem to be fully compatible with bamtools (version tested 2.5.1). I installed my bamtools not in the build directory but in a different directory (defined in DCMAKE_INSTALL_PREFIX). If the a user installs it this way (see example below), then where is no src in the installed location and also it seems with version 2.5.1, the lib location is not named lib64 as suggested in the HLA-LA makefile. Note I'm aware there is also the commented out codes in makefile for the older bamtools version as well, but still either way it's not fully compatible, and needs tweaking...

Methods used for bamtools installation, which doesn't quite work with the makefile defined by HLA-LA

git clone --branch v2.5.1 git://github.com/pezmaster31/bamtools.git
cd bamtools
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$BAMTOOLS_PATH ..
make
make install

About memory used

Hi,

my paired-end fastq file:
R1.fastq (250 Million reads, 150bp, ~1.2 GB)
R2.fastq (250 Million reads, 150bp, ~1.2 GB)

run HLA-LA will used about 300~400 GB RAM and ~90GB swap
Have any way to reduce RAM used?

my command is:

HLA-LA.pl --BAM Bwa-mem2/HLA_PM9-0232_S7.sorted.bam --graph PRG_MHC_GRCh38_withIMGT --workingDir . --sampleID HLA_LA --maxThreads 50

total time spend about 24HR~36HR

And another question is how can only typing some gene?
because I just need class I and II gene.

thanks!

yubau

Installing HLA*LA with bug

/usr/bin/ld: BFD version 2.20.51.0.2-5.42.el6 20100205 internal error, aborting at reloc.c line 443 in bfd_get_reloc_size

/usr/bin/ld: Please report this bug.

collect2: error: ld returned 1 exit status
make: *** [makefile:109: HLA-LA] Error 1

Graph path for Conda install

Hi, I have tried HLA:LA recently, I installed it through bioconda, however, I haven't seen any instruction at the end of the installation.

I downloaded the PRG_MHC_GRCh38_withIMGT graph manually, unzip them into the path ~/anaconda3/opt/hla-la/src/graphs/ (the full path is ~/anaconda3/opt/hla-la/src/graphs/PRG_MHC_GRCh38_withIMGT). However, when I run the command, it still gives error:


HLA-LA.pl --BAM $file --sampleID $sample_name --graph PRG_MHC_GRCh38_withIMGT/ --maxThreads 10 --workingDir ~/test_hlala
HLA-LA.pl

Identified paths:
        samtools_bin: ~/anaconda3/bin/samtools
        bwa_bin: ~/anaconda3/bin/bwa
        java_bin: ~/anaconda3/bin/java
        picard_sam2fastq_bin: ~/anaconda3/bin/picard
        General working directory: ~/test_hlala
        Sample-specific working directory: ~test_hlala/0411_01_01

Graph directory ~/opt/hla-la/src/../graphs/PRG_MHC_GRCh38_withIMGT/ not found - valid graph names are subdirectories of the graphs directory in the HLA-LA root at ~/anaconda3/bin/HLA-LA.pl line 198.

I checked the directory:


file "/home/nguyen/anaconda3/opt/hla-la/src/Graph/PRG_MHC_GRCh38_withIMGT/"
/home/nguyen/anaconda3/opt/hla-la/src/Graph/PRG_MHC_GRCh38_withIMGT/: directory

and the files inside

ls
extendedReferenceGenome  knownReferences  mapping  mapping_PRGonly  PRG  referenceGenomeSimulations  sampledReferenceGenomes  sequences.txt  translation

What did go wrong in my case?

BWA error

Hi,

I have been trying to use HLA-LA for illumina standard WES data. Unfortunately, I'm running into a BWA error after the edge path generation:

[ Wed Jun 24 19:10:29 2020 ] extensionAligner::extensionAligner(..): Indexing of edge paths; have 16804 completed edge paths.

$PATH/HLALA/bin/bwa mem -t15 -M -a /$PATH/HLALA/opt/hla-la/src/../graphs/PRG_MHC_GRCh38_withIMGT/extendedReferenceGenome/extendedReferenceGenome.fa /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/R_1.fastq /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/R_2.fastq | $PATH/HLALA/bin/samtools view -Sb - > /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/remapped_with_a.bam.unsorted
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 776266 sequences (110622870 bp)...
[E::bns_fetch_seq] begin=4104638566, mid=4104638567, end=3150488291, len=954150275, seq=0x7ff48c834010, rid=1141, far_beg=3150476765, far_end=3150488291
bwa: bntseq.c:444: bns_fetch_seq: Assertion `seq && *end - *beg == len' failed.

$PATH/HLALA/bin/samtools sort -o /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/remapped_with_a.bam /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/remapped_with_a.bam.unsorted
$PATH/HLALA/bin/samtools index /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/remapped_with_a.bam
 [ Wed Jun 24 19:10:56 2020 ] Remapping done.

processBAM::initBAM(..): Will examine 914 PRG-mapped intervals.
processBAM::extractSeeds(): examined 0 reads, transformed into 0 seeds.
processBAM::estimateInsertSize(): Deal 0 total proto-seeds (i.e. read pairs), of which 0 are incomplete.
HLA-LA: mapper/processBAM.cpp:1043: std::pair<double, double> mapper::processBAM::calculateInsertSizeFromHistogram(const std::map<int, double>&, bool): Assertion `set_weighted_80 && set_weighted_20 && set_median' failed.

HLA-LA execution not successful. Command was ../bin/HLA-LA --action HLA --maxThreads 15 --sampleID I19D040a04 --outputDirectory /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04 --PRG_graph_dir /$PATH/HLALA/opt/hla-la/src/../graphs/PRG_MHC_GRCh38_withIMGT --FASTQU /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/R_U.fastq.splitLongReads --FASTQ1 /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/R_1.fastq --FASTQ2 /scratch/leonb/exomes/mapped_grch38/typing/I19D040a04/R_2.fastq --bwa_bin $PATH/HLALA/bin/bwa --samtools_bin $PATH/HLALA/bin/samtools --mapAgainstCompleteGenome 1 --longReads 0

Thanks and best,
Leon

when I Compilation I get Error

installation prereq

Hi I am trying to install HLA_LA prereq boost. I downloaded boost here: https://www.boost.org/users/download/. But it doesnt seem to contain the folders "include" and "lib" that seem required from the makefile of HLA-LA.
Any help would be appreciated!
Thanks!

Indexing memory usage and other things

Hello!
I am running the indexing process inside the Docker-container on a server with 30 Gb RAM + 8 Gb swap and 8 CPUs.
In first 15 minutes of indexing RAM usage grew up to all RAM (29.3 of 29.5 Gb) and to 4 Gb of swap.

I think it would be better to mention it somewhere in prerequests.
Also the process took only 1 CPU of 8. Is it possible to use more?
As I understand, the indexing process generates 2 files (9.2 Gb):

-rw-r--r--  1 root  502 5491768320 Jun 19 17:41 serializedGRAPH
-rw-r--r--  1 root  502 4197127944 Jun 19 17:41 serializedGRAPH_preGapPathIndex

Does the program need all files from source PRG_MHC_GRCh38_withIMGT.tar.gz or only new 2 files?

All process took 2 hours and 20 minutes.
4. I made a second run and got a bit different files:

-rw-r--r--  1 root  502 5491768268 Jun 20 07:17 serializedGRAPH
-rw-r--r--  1 root  502 4197127912 Jun 20 07:17 serializedGRAPH_preGapPathIndex

How do I have to understand, that indexing is done correctly?

Custom path to indexed graph

Hello!
I think it would be more comfortable if user could specify full path to graph dir. It looks like the main place to be changed is:

my $full_graph_dir = $FindBin::RealBin . '/../graphs/' . $graph;

diltheylab / hla-la Goto Github PK

hla-la's Introduction

HLA*LA (formerly HLA*PRG:LA)

News

Basics

Installing HLA*LA

Via bioconda

Manual installation

Prerequisites

Compilation

Debugging issues with bamtools and boost

Download the data package

Modifying paths.ini

Index the graph

Testing your installation

Running HLA*LA

Cluster installation

CRAM files

Long reads

Assembly typing

Interpreting the output from HLA*PRG:LA

HLA-DRB orthologs

Adding further references

Citing HLA*LA

hla-la's People

Contributors

Stargazers

Watchers

Forkers

hla-la's Issues

Recommend Projects

Recommend Topics

Recommend Org

HLALA (formerly HLAPRG:LA)