dblhlx / pybsaseq Goto Github PK

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method

License: GNU General Public License v3.0

Python 100.00%

pybsaseq's Introduction

Note: It is strongly recommended to have fisher installed on your system. It is fast in dealing with large datasets.

Citation: If you used PyBSASeq in your pulications, please cite:

Zhang, J., Panthee, D.R. PyBSASeq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data. BMC Bioinformatics 21, 99 (2020). https://doi.org/10.1186/s12859-020-3435-8
Zhang, J., Panthee, D.R. Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes. G3 Genes|Genomes|Genetics, jkab400 (2021), https://doi.org/10.1093/g3journal/jkab400

PyBSASeq

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method

Python 3.6 or above is required to run the script.

Usage

$ python PyBSASeq.py -i input -o output -p popstrct -b fbsize,sbsize

Here are the details of the options used in the script:

input – the names of the input files (the GATK4-generated tsv/csv file). If you have variant calling data from both the parents and the bulks, the format is as follows: parents.tsv,bulks.tsv. If you have only the variant calling data of the bulks, the format is as follows: bulks.tsv. The script PyBSASeq.py can handle both situations now. The script and the input files should be in the same folder.
output – the name of the output file. Default is BSASeq.csv
popstrct – population structure; three choices available: F2 for an F₂ population, RIL for a population of recombinant inbred lines, or BC for a backcross population
fbsize – the number of individuals in the first bulk
sbsize – the number of individuals in the second bulk

The default cutoff p-value for identifying significant structural variants (sSV) from the SV dataset is 0.01 (alpha), and the default cutoff p-value for identifying sSVs from the simulated dataset is 0.01 (smalpha). These values can be changed using the following options:

-v alpha,smalpha

alpha and smalpha should be in the range of 0.0 – 1.0, the chosen value should make statistical sense. The greater the smalpha value, the higher the threshold and the lower the false positive rate.

The default size of the sliding window is 2000000 (base pairs) and the incremental step is 10000 (base pairs), and their values can be changed using the following option:

-s slidingWindowSize,incrementalStep

Four files (sliding_windows.csv, sv_fagz.csv, sv_fagz_fep.csv, and threshold.txt) and a folder in the date_time format containing BSASeq.csv and BSASeq.pdf will be generated in the ./Results folder after succesfully running the script. If gaps between subplots in PyBSASeq.pdf are too wide or too narrow, we can rerun the script to fine-tune the gaps by changing the values of a and/or b using the options below:

-a True -g a,b,c,d,e,f

a - the horizontal gap between subplots
b - the vertical gap between subplots
c, d, e, and f - the top, bottom, left, and right margins of the plot, respectively

The default values for a, b, c, d, e, and f are 0.028, 0.056, 0.0264, 0.054, 0.076, 0.002, 0.002, respectively. The script uses the sliding_windows.csv and threshold.txt files generated previously for plotting, and it is very fast.

If two or more peaks/valleys and all the values in between are beyond the confidence intervals/thresholds, only the highest peak or the lowerest valley will be identified as the peak/valley of this region. We can rerun the script to identify the positions of the other peaks/valleys and test their significance using the option below:

-e a1,b1,c1,a2,b2,c2,......,an,bn,cn

a - chromosome id
b - the start point of a chromosomal fragment
c - the end point of a chromosomal fragment

Right now, this option will not work if the chromosome IDs in the reference genome sequences are not digits, with the exception of sex chromosomes; we can use 1000 - 1005 to respectively represent sex chromosomes X, Y, Z, W, U, and V when specify regions on these chromosomes. Multiple regions on the same chromsome can be selected.

Workflow

Structural variant (SV) filtering
Perform Fisher's exact test using the AD values of each SV in both bulks. A SV would be identified as an sSV if its p-value is less than alpha. In the meantime, simulated REF/ALT reads of each SV is obtained via simulation under null hypothesis, and Fisher's exact test is also performed using these simulated AD values; for each SV, it would be an sSV if its p-value is less than smalpha. Identification of sSVs from the simulated dataset is for threshold calculation.
Threshold calculation. The result is saved in the threshold.txt file. The threshold.txt file needs to be deleted if starting over is desired (e.g, if the size of the sliding window is changed).
Plotting.

Test datasets

Two SV datasets for testing purpose, one from rice while the other from maize, are stored in the Data folder. The rice .csv files were generated via GATK4 using the sequencing data from the work of Lahari et al while the maize .csv files were extracted from .csv/.tsv files generated by Zheng et al.

Interpret the results

In the file PyBSASeq.pdf, peaks/valleys in panels B, C, and D that are beyond their genome-wide thresholds/confidence intervals very likely represent genomic regions controlling the trait of interest. The genomic region-specific thresholds/confidence intervals of these peaks/valleys are calculated to verrify their significance status. The results are presented in the file BSASeq.csv. A peak/valley is significant if its significance_status is equal to 1, not significant otherwise.

In BSASeq.csv, each row indicates a sliding window that contains a potentially significant peak/valley. The meanning of each column is shown below.

firstbulk_id.AvgLD: the average sequencing depth of the first bulk
secondbulk_id.AvgLD: the average sequencing depth of the second bulk
sSV: the number of the significant SVs in the sliding window
totalSV: the number of the total SVs in the sliding window
sSV/totalSV: the sSV/totalSV ratio in the sliding window
Threshold_sSV: the threshold of the sSV/totalSV ratio of the sliding window
GS: the G-statistic value of the sliding window
Threshold_GS: the threshold of the G-statistic value of the sliding window
DAF: the allele frequency difference of the sliding window
DAF_CI_LB: the DAF confidence interval lower bound of the sliding window
DAF_CI_UB: the DAF confidence interval upper bound of the sliding window
Threshold_DAF: the DAF threshold of the sliding window when parental genome sequences are not avaiable (or not used)
pvalue_tt: the t-test p-value of the sliding window
Significance_SSV: the significant status of the sliding window when using the significant SV method
Significance_GS: the significant status of the sliding window when using the G-statistic method
Significance_AF: the significant status of the sliding window when using the allele frequency method
Significance_TT: the significant status of the sliding window when using the t-test method

Bug report

Please report here if encounter any issue. I can receive issue notifications now.

Other methods for BSA-Seq data analysis

BSA-Seq data analysis can be done using either the SNP index (allele frequency) method or the G-statistic method as well. I implemented both methods in Python for the purpose of comparison: PySNPIndex and PyGStatistic. The Python implementation of the original G-statistic method by Magwene can be found here (just found this site today, 6/27/2019), and the R implementation of both methods by Mansfeld can be found here.

pybsaseq's People

Contributors

Stargazers

Watchers

Forkers

hpli9527 shulp2211 samhuairen omicsclass peirong777 zunairaafzal-hub wangpanqiao manigben sameermudbhari sultan-mia manoj34sapkota jiangchb

pybsaseq's Issues

Create exception for ZeroDivisionError

Hello!

We found some unexpected behaviour in the PyBSAseq script. When setting a window size that is too small, meaning that there are windows that do not contain any SNPs, PyBSAseq fails with the following error:

It seems like unexpected behaviour that the pipeline fails, when windows contain no SNPs.
Would it be possible for you to create an exception for this error and e.g. infer the value "0" for the ratio sSV / total SV in the effected windows?

Thanks a lot for your help in advance!

Kind regards,
Fleur

Please add license

Hi again,

I noticed that you have no license file in your repository. In your publication you have selected the MIT license:

If you need an example here is one from the freebayes variant caller repo (also MIT): https://github.com/freebayes/freebayes/blob/master/LICENSE

Plot explanation

Dear,

I ran PyBSASeq using default settings and was wondering about the output, especially the plots.
Please find attached the png file.

I have several questions:

I see in the (a) row many of my raw variants (blue) get lost in filtering (black), is this as expected? The filtering effect does look similar to Figure 1a of the PyBSASeq publication.
In the (b) row I see some very large peaks that mostly emphasize the signal of the variant rich regions after filtering from the (a) row, but the red threshold line does not seem to make a useful distinction. This is different compared to Figure 1b of the PyBSASeq publication where the red line seems to distinguish large peaks from small ones.
The G-statistic panel (c) shows some very interesting peaks, but I do not understand the meaning of the pink and red lines. The red line seems to be near 0 and not useful. The pink one looks like a usable threshold to me.
The delta Allele frequency panel (d) shows a red line that seems to be a threshold but for the pink line it is unclear what it means. A colleague of mine used PyBSASeq a few years ago and got a rather different plot more like Figure 3 of the publication that had also negative delta. Did something change in the software over time? I could not find a commit pointing to such a thing, only that there seemed to be a WoP program previously.

Furthermore, based on the different methods shown in the plot, the significant SNP method yield many more and also quite broad peak compared to the G-statistic. The delta AF method results in many more QTL as well. What would you advise be in this case? Focus on chr3 and ch6 or should all the other peaks also have the same priority?

Looking forward to your expert advise.

Kind regards,

Ronald

Help needed in error resolution

Hi I am using PyBSASeq lib for my research but continuously running into this error
eoor.pptx
error.docx

BSASeq.csv

The package is really good. If possible, please can you include a section in the README that lists what the headings are in the BSASeq.csv table.

highbulk.AvgLD
lowbulk.AvgLD
sSV
totalSV
sSV/totalSV
Threshold_sSV
GS
Threshold_GS
DAF
DAF_CI_LB
DAF_CI_UB
Threshold_DAF
pvalue_tt
Significance_SSV
Significance_GS
Significance_AF
Significance_TT

TypeError: '<' not supported between instances of 'str' and 'int'

Got an error when loaded GATK result with PyBSASeq BulksOnly script.

Perform SNP filtering Traceback (most recent call last): File "/data/mg1/chenhx/software/PyBSASeq/BulksOnly/PyBSASeq.py", line 993, in <module> bsaSNPs = snpFiltering(snpRawDF) File "/data/mg1/chenhx/software/PyBSASeq/BulksOnly/PyBSASeq.py", line 112, in snpFiltering df_lowq = df[(df[fb_GQ]<gqValue) | (df[sb_GQ]<gqValue)] File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/common.py", line 65, in new_method return method(self, other) File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 370, in wrapper res_values = comparison_op(lvalues, rvalues, op) File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 244, in comparison_op res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 56, in comp_method_OBJECT_ARRAY result = libops.scalar_compare(x.ravel(), y, op) File "pandas/_libs/ops.pyx", line 103, in pandas._libs.ops.scalar_compare TypeError: '<' not supported between instances of 'str' and 'int'

ValueError: Columns must be same length as key

Got an error when loaded GATK result with PyBSASeq BulksOnly script.

My GATK result was extracted from a joint call VCF with VariantsToTable, and converted to CSV format.

Perform SNP filtering Traceback (most recent call last): File "PyBSASeq.py", line 992, in <module> bsaSNPs = snpFiltering(snpRawDF) File "PyBSASeq.py", line 156, in snpFiltering df_2ALT_Real[['REF', 'ALT']] = df_2ALT_Real['ALT'].str.split(',', expand=True) File "/Users/haohu/miniconda3/envs/PyBSASeq/lib/python3.6/site-packages/pandas/core/frame.py", line 3041, in __setitem__ self._setitem_array(key, value) File "/Users/haohu/miniconda3/envs/PyBSASeq/lib/python3.6/site-packages/pandas/core/frame.py", line 3067, in _setitem_array raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key

name 'separator' is not defined

Running PyBSAseq without interactive input

Hi,

first thanks for this great tool! It has been very helpful and easy to use in contrast to some other BSAseq tools out there :)
We would like to use PyBSAseq as part of a fully automated pipeline and the interactive input is hindering us a little bit. Would there be a possibility to switch that behavior off or make it optional?
In our case the data comes preprocessed so that we have only chromosomes we want to use in the input and the order is also correct.

Thanks!

does the order of parents/bulks matter?

Hi! It's me again! I finally have the variant call against the genome of one of the parents. If I understood well I need to add a parameter --parent parent_genome.fasta, right? Then my question is, does it matter the order of samples in the tsv file? Should the high/low order be the same in parent and bulks tsv? Thanks so much!

parents file

Hi!
Sorry for my dumb questions, but I'm really interested in using your program and I don't have experience with bulk segregation analysis. I understood that the parent genome is not needed in the current version of the program, but as I have, I suppose it would be better to use it. In this case, the parents.tsv is the SNP calling of both parents using one of the parents as a reference? Another question, I see the figure generated are "Number of SVs" and and not "Number of SNPs" as in the papers. Am I using the correct version of the program?

Thanks very much

Non-numeric contig IDs aren't tolerated

Hi,

In the process of running this software, I have run into the below error.

ValueError: invalid literal for int() with base 10: 'NC_052499.1'

I've changed the code as a workaround i.e.,

#peaklst = sorted(peaklst, key = lambda x: (int(x[0]), int(x[1])))
peaklst = sorted(peaklst, key = lambda x: (x[0], int(x[1])))

It might be nice if it supported non-numeric contig IDs by default though. Regardless, I thought I'd bring this to attention in case anyone else has a similar problem.

test files

Hi!
I want to test the program to understand how it works and what are the outputs. So, I'll use one of the files that you provided to test. I'm just wondering which are the values for these parameters: -p popstrct -b fbsize,sbsize . Thanks very much!

Chromosome size

My reference genome still has a lot of smaller contigs. We will make a new HI-C assembly and linkage map later in the year.
In my first run I used the defaults (default='2000000,10000') and chrs smaller than 2510000 bp were not included/ not suitable for analyses, can you recommend the smallest window and increment that will be useable for a deeper scan and not return an invalid lambda error?
Thank you.