Giter Site home page Giter Site logo

genid / yleaf Goto Github PK

View Code? Open in Web Editor NEW
22.0 8.0 10.0 42.21 MB

Yleaf software for human Y-chromosomal haplogroup inference from next generation sequencing data

License: GNU General Public License v3.0

Python 100.00%
y-chromosome haplogroup prediction-algorithm next-generation-sequencing python

yleaf's Introduction

Yleaf: software for human Y-chromosomal haplogroup inference from next generation sequencing data

Arwin Ralf, Diego Montiel Gonzalez, Kaiyin Zhong and Manfred Kayser

Department of Genetic Identification

Erasmus MC University Medical Centre Rotterdam, The Netherlands

Requirements

Operating system: Linux only. 
Internet connection: when running for the first time for downloading the reference genome. Alternatively you 
                     can configure your own references.
Data storage: For installation we recommend a storage capacity of > 8 GB. 

Installation

The easiest way to get Yleaf up and running is by using a conda environment.

# first clone this repository to get the environment_yleaf.yaml
git clone https://github.com/genid/Yleaf.git
cd Yleaf
# create the conda environment from the .yaml the environment will be called yleaf
conda env create --file environment_yleaf.yaml
# activate the environment
conda activate yleaf
# pip install the cloned yleaf into your environment. Using the -e flag allows you to modify the config file in your cloned folder
pip install -e .

# verify that Yleaf is installed correctly. You can call this command from any directory on your system
Yleaf -h 

or manually install everything

# install python and libraries
apt-get install python3.6
pip3 install pandas
pip3 install numpy
# install Burrows-Wheeler Aligner for FASTQ files
sudo apt-get install minimap2 
# install SAMtools
wget https://github.com/samtools/samtools/releases/download/1.4.1/
samtools-1.4.1.tar.bz2 -O samtools.tar.bz2
tar -xjvf samtools.tar.bz2 3. 
cd samtools-1.4.1/
./configure 5. make
make install
# clone the yleaf repository
git clone https://github.com/genid/Yleaf.git
# pip install the yleaf repository
cd Yleaf
pip install -e .

# verify that Yleaf is installed correctly. You can call this command from any directory on your system
Yleaf -h 

After installation you can navigate to the yleaf/config.txt folder and add custom paths for the files listed there. This will make sure that Yleaf does not download the files on the first go or downloads the files in the provided location. This allows you to use a custom reference if you want. Please keep in mind that custom reference files might cause other issues or give problems in combination with already existing data files. Positions are based on either hg38 or hg19.

Usage and examples

Here follow some minimal working examples of how to use Yleaf with different input files. There are additional options that can be used to tune how strict Yleaf is as well as options to get private mutations as well as a graph showing the positioning of predicted haplogroups of all your samples in the Haplogroup tree.

Note: In version 3.0 we switched to using YFull (v10.01) for the underlying tree structure of the haplogroups. This also means that predictions are a bit different compared to earlier versions.

Yleaf: FASTQ (raw reads)

Yleaf -fastq raw_reads.fastq -o fastq_output --reference_genome hg38

Yleaf: BAM or CRAM format

Yleaf -bam file.bam -o bam_output --reference_genome hg19 
Yleaf -cram file.bam -o cram_output --reference_genome hg38 

With drawing predicted haplogroups in a tree and showing all private mutations

Yleaf -bam file.bam -o bam_output --reference_genome hg19 -dh -p

Additional information

For a more comprehensive manual please have a look at the yleaf_manual.

If you have a bug to report or a question about installation consider sending an email to a.ralf at erasmusmc.nl or create an issue on GitHub.

References and Supporting Information

A. Ralf, et al., Yleaf: software for human Y-chromosomal haplogroup inference from next generation sequencing data (2018).

https://academic.oup.com/mbe/article/35/5/1291/4922696

yleaf's People

Contributors

6bass6 avatar bramvanwersch avatar cascadingstyletrees avatar dionzand avatar dmontielg avatar stikus avatar teepean avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yleaf's Issues

Add Yleaf to PyPI/bioconda

Are there plans to add Yleaf to PyPI and/or bioconda? (I did not find it on either. Sorry if it is already there.)

The existing conda environment file is already great for standalone running, but including Yleaf into PyPI (and from there bioconda) would get you a docker/singularity container for each release without any extra effort, via biocontainers. That in turn would make it possible to include Yleaf in reproducible workflows and pipelines.

Add T2T positions?

Hello!

Would it be possible to add T2T to Yleaf positions and support for T2T-CHM13v2.0 reference?

Thanks!

Updating position files

Hi,

I have been updating hg19 position files and came across a problem I am not sure how to deal with. For example Z14 is represented in ISOGG as follows and Yleaf cannot parse that line.

Z14 R1b1a1b1a1a1b1 17364720..17364737 15252840..15252857 CAGATAGATAGATAGATA->CAGATAGATAGATA

Thanks!

Difference in results between Yleaf v3.0.1 and v3.1 (master)

Hello, after migration from Yleaf v3.0.1 to latest version (not using v3.1 due to #15 and #16 in v3.1 release) we've got some differences:

Yleaf_version Sample_name Hg Hg_marker Total_reads Valid_markers QC-score QC-1 QC-2 QC-3
latest test N1~ N-CTS11499/etc*(xCTS10760,Z4963,B195,Y13851,F859,PF967.2,Y16325,M2118,Y9025,B187,Y24348,FGC10788,F1228,CTS5397) 4124 64768 1.0 1.0 1.0 1.0
v3.0.1 test N1a1a1a1a2a1a1~ N-Z1926*(xCTS1737,Y21699) 5549131 64771 1.0 1.0 1.0 1.0

You can see two major differences - Hg (and Hg_marker) and Total_reads (and Valid_markers btw).

  • First difference come from sorting order changed in bc03016:

image

Is this intended, that instead of N1a1a1a1a2a1a1~ we're getting N1~?


  • Second difference come from code refactoring:

Total_reads and Valid_markers come from 2nd and last positions of log:
https://github.com/genid/Yleaf/blob/master/yleaf/old_predict_haplogroup.py#L255-L287

def process_log(log_file):
    log_file += "info"
    total_reads = "NA"
    valid_markers = "NA"

    try:
        df_log = pd.read_csv(log_file, sep=":", header=None)
        log_array = df_log[1].values
        total_reads = log_array[1]
        valid_markers = log_array[-1]
    except FileNotFoundError:
        print("Warning: log file not found!")
    return total_reads, valid_markers


def main():
    print("\tY-Haplogroup Prediction")

    args = get_arguments()

    path_samples = args.Input  # .out files are collected
    samples = check_if_folder(path_samples, '.out')
    out_file = args.Outputfile
    hg_intermediate = str(yleaf_constants.DATA_FOLDER / yleaf_constants.HG_PREDICTION_FOLDER)
    intermediate_tree_table = hg_intermediate + "/Intermediates.txt"
    h_flag = True
    log_output = []
    for sample_name in samples:
        putative_hg = "NA"
        out_name = str(sample_name.split("/")[-1])
        out_name = out_name.split(".")[0]

        total_reads, valid_markers = process_log(sample_name[:-3])

In v3.0.1 it was correct:

But now it is not correct:

Log for our data:

Total of mapped reads: 5549131
Total of unmapped reads: 4124
Valid markers: 64797
Markers with zero reads: 0
Markers below the read threshold {1}: 0
Markers below the base majority threshold {90}: 28
Markers with discordant genotype: 1
Markers without haplogroup information: 29
Markers with haplogroup information: 64768

Now we need first and third line, not second and last.

What if input paired-end reads fastq file

Hi, my fastq files are generated by paired-end sequencing, so that I have two fq files for each sample (e.g. sample.1.fq.gz and sample.2.fq.gz). Does Yleaf support two fq files input like -fastq sample.1.fastq -fastq sample.2.fastq?

Best wishes
Xb

2.3 release

Hello!

Thank you for the tool! Do you plan to release the 2.3 version?

Yleaf.py: error: argument -r/--Reads_thresh: invalid int value: 'ef'

Hi, do you know what might be the problem?

##################
mikael@mikael-HP-Z600-Workstation[Yleaf] python Yleaf.py -bam /media/hd01/data/genome_mikael/cleanreads/md.chrY.bam -ref hg38 -pos /usr/local/bioinf/Yleaf/hg38.txt -out /media/hd01/data/genome_mikael/cleanreads/ydna_out -r 1 -q 20 -b 90 -t 1
Erasmus MC Department of Genetic Identification

Yleaf: software tool for human Y-chromosomal 
phylogenetic analysis and haplogroup inference v2.1



       |
      /|\          
     /\|/\    
    \\\|///   
     \\|//  
      |||   
      |||    
      |||    

usage: Yleaf.py [-h] [-fastq PATH] [-bam PATH] [-f PATH] -pos PATH -out STRING
[-r READS_THRESH] -q QUALITY_THRESH -b BASE_MAJORITY
[-t THREADS]
Yleaf.py: error: argument -r/--Reads_thresh: invalid int value: 'ef'

usage: Yleaf.py [-h] [-fastq PATH] [-bam PATH] [-f PATH] -pos PATH -out STRING
[-r READS_THRESH] -q QUALITY_THRESH -b BASE_MAJORITY
[-t THREADS]
Yleaf.py: error: argument -r/--Reads_thresh: invalid int value: 'ef'
##############

No haplogroup result if I use the -r1 option

I'm using Yleaf with the option -r 2 and -r 1 in order to make some comparisons between the two outputs but I get stucked because of a strange result has occured: the haplogorup prediction for the -r 1 doesn't show any result, even if the -r 2 one has shown the haplogorup R1b1a1b1a as well.
I didnt't expect a result like this because the -r1 option is less stringent and so it may creates a finer result then the -r2 option.
Why's this happening? What's gone wrong?
I copy the lines I used for both the -r2 analysis' steps below:

Yleafv2.2/Yleaf.py -bam C-58_picard.bam -pos /Yleafv2.2/WGS_hg19_noChr.txt -out mysample_Y_r2_DNA_Yleaf -r 2 -q 20 -b 90 -t 6

/Yleafv2.2/predict_haplogroup.py -input mysample_Y_r2_DNA_Yleaf -out 58_r2_y.hg 2> hg.err

the log file contains this information: Set max per-file depth to 8000

Old Prediction Option not using the correct files

I created #15 as the first of a few changes that are needed to fix the old prediction option.

The next part should be to fix the intermediates.txt file look up since currently it's missing a /, so you'd see an error that the file wasn't found at .../data/hg_prediction_tablesIntermediates.txt

One option I found was to change Intermediates.txt to /Intermediates.txt at

intermediate_tree_table = hg_intermediate + "Intermediates.txt"

This however hasn't yet fixed all the issues that I'm having with old predictions.

No such file or directory: 'Hg_Prediction_tables/tree.json'

Hi I had installed and using a previous version.

I updated to 3.0 and now I get the following error:
[mpileup] 1 samples in 1 input files
--- 0.20 seconds in run PileUp ---
Extracting haplogroups...
Traceback (most recent call last):
File "/home/psonis/software/Yleaf/Yleaf.py", line 587, in
main()
File "/home/psonis/software/Yleaf/Yleaf.py", line 561, in main
output_file = samtools(args.threads, folder, folder_name, bam_file, args.Quality_thresh,
File "/home/psonis/software/Yleaf/Yleaf.py", line 454, in samtools
extract_haplogroups(markerfile, args.Reads_thresh, args.Base_majority,
File "/home/psonis/software/Yleaf/Yleaf.py", line 413, in extract_haplogroups
tree = Tree("Hg_Prediction_tables/tree.json")
File "/home/psonis/software/Yleaf/tree.py", line 33, in init
self._construct_tree(file)
File "/home/psonis/software/Yleaf/tree.py", line 41, in _construct_tree
with open(file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'Hg_Prediction_tables/tree.json'

Where is this file?

Also, I updated to 3.01 and everything is different. Not sure which is the executable, which are the proper files to use...
when testing python yleaf/Yleaf.py I get
Traceback (most recent call last):
File "/home/psonis/software/Yleaf/yleaf/Yleaf.py", line 30, in
from yleaf import version
ModuleNotFoundError: No module named 'yleaf'

How to choose the position files?

Hi genid, I found there are three versions of 'Position File' (MCS_Ampliseq/Visage_Ampliseq/WGS). What are the differences between them? How to choose one? I use the GRCh37 (100 genomes) as the reference and I don't know which one I can use. Thanks!

Article link in the readme is broken

Hello!

The link in the readme points to: https://academic.oup.com/mbe/article/35/7/1820/4993044, which is:

This is a correction to:
Molecular Biology and Evolution, Volume 35, Issue 5, May 2018, Pages 1291โ€“1294, https://doi.org/10.1093/molbev/msy032
This article published with a comment intended only to the editors of the journal. The comment has been removed. The author regrets the error.

So the correct link is the original one?
https://academic.oup.com/mbe/article/35/5/1291/4922696

output

hi, I want to infer Y-haplogroups of 100 samples using their bam files merged as one, so when I run this tool, (using the command: Yleaf -bam input.bam -o output -rg hg19), in the output folder, the .out file contains haplogroup per position which is confusing, I don't know where to get haplogroups per sample, can you please help me? thanks

Error indicated when running Yleaf installed with conda on CentOs

After a conda install of Yleaf on Centos 7, I receive the following error message. This does not prevent the run but it indicates a problem:
Error processing line 1 of /home/grange/miniconda2/envs/yleaf/lib/python3.7/site-packages/distutils-precedence.pth:

Traceback (most recent call last):
File "/home/grange/miniconda2/envs/yleaf/lib/python3.7/site.py", line 168, in addpackage
exec(line)
File "", line 1, in
ModuleNotFoundError: No module named '_distutils_hack'

Any suggestions of script modification or of why this module is not installed?

Thanks

Thierry

Few queries

Hi. Thanks for the nice software. I have 2 queries.

  1. Is there a way to ignore C->T and G->A SNPs in the tool's prediction? That will be invaluable for ancient DNA.
  2. Is there a way to run this on multiple bam files in one command?

Thank you.

hg19

Hi,
is hg19 the one from UCSC or the equivalent of GRCh37, b37, h37 etc?
I have my samples mapped against hs37d5 (1000 Genomes project phase II). Is it proper for use with Yleaf?

Also when I am running yleaf 3.1 (conda installation as proposed in the manual) I get the following when the program starts

_Error processing line 1 of /home/psonisns/miniconda3/envs/yleaf/lib/python3.7/site-packages/distutils-precedence.pth:

Traceback (most recent call last):
File "/home/psonisns/miniconda3/envs/yleaf/lib/python3.7/site.py", line 168, in addpackage
exec(line)
File "", line 1, in
ModuleNotFoundError: No module named '_distutils_hack'

Remainder of file ignored_

But then it continues with the analysis without any issues (I think) ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.