Giter Site home page Giter Site logo

pybsaseq's Introduction

Note: It is strongly recommended to have fisher installed on your system. It is fast in dealing with large datasets.

Citation: If you used PyBSASeq in your pulications, please cite:

  • Zhang, J., Panthee, D.R. PyBSASeq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data. BMC Bioinformatics 21, 99 (2020). https://doi.org/10.1186/s12859-020-3435-8
  • Zhang, J., Panthee, D.R. Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes. G3 Genes|Genomes|Genetics, jkab400 (2021), https://doi.org/10.1093/g3journal/jkab400

PyBSASeq

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method

Python 3.6 or above is required to run the script.

Usage

$ python PyBSASeq.py -i input -o output -p popstrct -b fbsize,sbsize

Here are the details of the options used in the script:

  • input – the names of the input files (the GATK4-generated tsv/csv file). If you have variant calling data from both the parents and the bulks, the format is as follows: parents.tsv,bulks.tsv. If you have only the variant calling data of the bulks, the format is as follows: bulks.tsv. The script PyBSASeq.py can handle both situations now. The script and the input files should be in the same folder.
  • output – the name of the output file. Default is BSASeq.csv
  • popstrct – population structure; three choices available: F2 for an F2 population, RIL for a population of recombinant inbred lines, or BC for a backcross population
  • fbsize – the number of individuals in the first bulk
  • sbsize – the number of individuals in the second bulk

The default cutoff p-value for identifying significant structural variants (sSV) from the SV dataset is 0.01 (alpha), and the default cutoff p-value for identifying sSVs from the simulated dataset is 0.01 (smalpha). These values can be changed using the following options:

-v alpha,smalpha

alpha and smalpha should be in the range of 0.0 – 1.0, the chosen value should make statistical sense. The greater the smalpha value, the higher the threshold and the lower the false positive rate.

The default size of the sliding window is 2000000 (base pairs) and the incremental step is 10000 (base pairs), and their values can be changed using the following option:

-s slidingWindowSize,incrementalStep

Four files (sliding_windows.csv, sv_fagz.csv, sv_fagz_fep.csv, and threshold.txt) and a folder in the date_time format containing BSASeq.csv and BSASeq.pdf will be generated in the ./Results folder after succesfully running the script. If gaps between subplots in PyBSASeq.pdf are too wide or too narrow, we can rerun the script to fine-tune the gaps by changing the values of a and/or b using the options below:

-a True -g a,b,c,d,e,f

  • a - the horizontal gap between subplots
  • b - the vertical gap between subplots
  • c, d, e, and f - the top, bottom, left, and right margins of the plot, respectively

The default values for a, b, c, d, e, and f are 0.028, 0.056, 0.0264, 0.054, 0.076, 0.002, 0.002, respectively. The script uses the sliding_windows.csv and threshold.txt files generated previously for plotting, and it is very fast.

If two or more peaks/valleys and all the values in between are beyond the confidence intervals/thresholds, only the highest peak or the lowerest valley will be identified as the peak/valley of this region. We can rerun the script to identify the positions of the other peaks/valleys and test their significance using the option below:

-e a1,b1,c1,a2,b2,c2,......,an,bn,cn

  • a - chromosome id
  • b - the start point of a chromosomal fragment
  • c - the end point of a chromosomal fragment

Right now, this option will not work if the chromosome IDs in the reference genome sequences are not digits, with the exception of sex chromosomes; we can use 1000 - 1005 to respectively represent sex chromosomes X, Y, Z, W, U, and V when specify regions on these chromosomes. Multiple regions on the same chromsome can be selected.

Workflow

  1. Structural variant (SV) filtering
  2. Perform Fisher's exact test using the AD values of each SV in both bulks. A SV would be identified as an sSV if its p-value is less than alpha. In the meantime, simulated REF/ALT reads of each SV is obtained via simulation under null hypothesis, and Fisher's exact test is also performed using these simulated AD values; for each SV, it would be an sSV if its p-value is less than smalpha. Identification of sSVs from the simulated dataset is for threshold calculation.
  3. Threshold calculation. The result is saved in the threshold.txt file. The threshold.txt file needs to be deleted if starting over is desired (e.g, if the size of the sliding window is changed).
  4. Plotting.

Test datasets

Two SV datasets for testing purpose, one from rice while the other from maize, are stored in the Data folder. The rice .csv files were generated via GATK4 using the sequencing data from the work of Lahari et al while the maize .csv files were extracted from .csv/.tsv files generated by Zheng et al.

Interpret the results

In the file PyBSASeq.pdf, peaks/valleys in panels B, C, and D that are beyond their genome-wide thresholds/confidence intervals very likely represent genomic regions controlling the trait of interest. The genomic region-specific thresholds/confidence intervals of these peaks/valleys are calculated to verrify their significance status. The results are presented in the file BSASeq.csv. A peak/valley is significant if its significance_status is equal to 1, not significant otherwise.

In BSASeq.csv, each row indicates a sliding window that contains a potentially significant peak/valley. The meanning of each column is shown below.

  • firstbulk_id.AvgLD: the average sequencing depth of the first bulk
  • secondbulk_id.AvgLD: the average sequencing depth of the second bulk
  • sSV: the number of the significant SVs in the sliding window
  • totalSV: the number of the total SVs in the sliding window
  • sSV/totalSV: the sSV/totalSV ratio in the sliding window
  • Threshold_sSV: the threshold of the sSV/totalSV ratio of the sliding window
  • GS: the G-statistic value of the sliding window
  • Threshold_GS: the threshold of the G-statistic value of the sliding window
  • DAF: the allele frequency difference of the sliding window
  • DAF_CI_LB: the DAF confidence interval lower bound of the sliding window
  • DAF_CI_UB: the DAF confidence interval upper bound of the sliding window
  • Threshold_DAF: the DAF threshold of the sliding window when parental genome sequences are not avaiable (or not used)
  • pvalue_tt: the t-test p-value of the sliding window
  • Significance_SSV: the significant status of the sliding window when using the significant SV method
  • Significance_GS: the significant status of the sliding window when using the G-statistic method
  • Significance_AF: the significant status of the sliding window when using the allele frequency method
  • Significance_TT: the significant status of the sliding window when using the t-test method

Bug report

Please report here if encounter any issue. I can receive issue notifications now.

Other methods for BSA-Seq data analysis

BSA-Seq data analysis can be done using either the SNP index (allele frequency) method or the G-statistic method as well. I implemented both methods in Python for the purpose of comparison: PySNPIndex and PyGStatistic. The Python implementation of the original G-statistic method by Magwene can be found here (just found this site today, 6/27/2019), and the R implementation of both methods by Mansfeld can be found here.

pybsaseq's People

Contributors

dblhlx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pybsaseq's Issues

Create exception for ZeroDivisionError

Hello!

We found some unexpected behaviour in the PyBSAseq script. When setting a window size that is too small, meaning that there are windows that do not contain any SNPs, PyBSAseq fails with the following error:

image

It seems like unexpected behaviour that the pipeline fails, when windows contain no SNPs.
Would it be possible for you to create an exception for this error and e.g. infer the value "0" for the ratio sSV / total SV in the effected windows?

Thanks a lot for your help in advance!

Kind regards,
Fleur

Plot explanation

Dear,

I ran PyBSASeq using default settings and was wondering about the output, especially the plots.
Please find attached the png file.
PyBSASeq

I have several questions:

  1. I see in the (a) row many of my raw variants (blue) get lost in filtering (black), is this as expected? The filtering effect does look similar to Figure 1a of the PyBSASeq publication.
  2. In the (b) row I see some very large peaks that mostly emphasize the signal of the variant rich regions after filtering from the (a) row, but the red threshold line does not seem to make a useful distinction. This is different compared to Figure 1b of the PyBSASeq publication where the red line seems to distinguish large peaks from small ones.
  3. The G-statistic panel (c) shows some very interesting peaks, but I do not understand the meaning of the pink and red lines. The red line seems to be near 0 and not useful. The pink one looks like a usable threshold to me.
  4. The delta Allele frequency panel (d) shows a red line that seems to be a threshold but for the pink line it is unclear what it means. A colleague of mine used PyBSASeq a few years ago and got a rather different plot more like Figure 3 of the publication that had also negative delta. Did something change in the software over time? I could not find a commit pointing to such a thing, only that there seemed to be a WoP program previously.

Furthermore, based on the different methods shown in the plot, the significant SNP method yield many more and also quite broad peak compared to the G-statistic. The delta AF method results in many more QTL as well. What would you advise be in this case? Focus on chr3 and ch6 or should all the other peaks also have the same priority?

Looking forward to your expert advise.

Kind regards,

Ronald

BSASeq.csv

The package is really good. If possible, please can you include a section in the README that lists what the headings are in the BSASeq.csv table.

highbulk.AvgLD
lowbulk.AvgLD
sSV
totalSV
sSV/totalSV
Threshold_sSV
GS
Threshold_GS
DAF
DAF_CI_LB
DAF_CI_UB
Threshold_DAF
pvalue_tt
Significance_SSV
Significance_GS
Significance_AF
Significance_TT

TypeError: '<' not supported between instances of 'str' and 'int'

Got an error when loaded GATK result with PyBSASeq BulksOnly script.

Perform SNP filtering Traceback (most recent call last): File "/data/mg1/chenhx/software/PyBSASeq/BulksOnly/PyBSASeq.py", line 993, in <module> bsaSNPs = snpFiltering(snpRawDF) File "/data/mg1/chenhx/software/PyBSASeq/BulksOnly/PyBSASeq.py", line 112, in snpFiltering df_lowq = df[(df[fb_GQ]<gqValue) | (df[sb_GQ]<gqValue)] File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/common.py", line 65, in new_method return method(self, other) File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 370, in wrapper res_values = comparison_op(lvalues, rvalues, op) File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 244, in comparison_op res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) File "/data/mg1/chenhx/miniconda3/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 56, in comp_method_OBJECT_ARRAY result = libops.scalar_compare(x.ravel(), y, op) File "pandas/_libs/ops.pyx", line 103, in pandas._libs.ops.scalar_compare TypeError: '<' not supported between instances of 'str' and 'int'

ValueError: Columns must be same length as key

Got an error when loaded GATK result with PyBSASeq BulksOnly script.

My GATK result was extracted from a joint call VCF with VariantsToTable, and converted to CSV format.

Perform SNP filtering Traceback (most recent call last): File "PyBSASeq.py", line 992, in <module> bsaSNPs = snpFiltering(snpRawDF) File "PyBSASeq.py", line 156, in snpFiltering df_2ALT_Real[['REF', 'ALT']] = df_2ALT_Real['ALT'].str.split(',', expand=True) File "/Users/haohu/miniconda3/envs/PyBSASeq/lib/python3.6/site-packages/pandas/core/frame.py", line 3041, in __setitem__ self._setitem_array(key, value) File "/Users/haohu/miniconda3/envs/PyBSASeq/lib/python3.6/site-packages/pandas/core/frame.py", line 3067, in _setitem_array raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key

Running PyBSAseq without interactive input

Hi,

first thanks for this great tool! It has been very helpful and easy to use in contrast to some other BSAseq tools out there :)
We would like to use PyBSAseq as part of a fully automated pipeline and the interactive input is hindering us a little bit. Would there be a possibility to switch that behavior off or make it optional?
In our case the data comes preprocessed so that we have only chromosomes we want to use in the input and the order is also correct.

Thanks!

does the order of parents/bulks matter?

Hi! It's me again! I finally have the variant call against the genome of one of the parents. If I understood well I need to add a parameter --parent parent_genome.fasta, right? Then my question is, does it matter the order of samples in the tsv file? Should the high/low order be the same in parent and bulks tsv? Thanks so much!

parents file

Hi!
Sorry for my dumb questions, but I'm really interested in using your program and I don't have experience with bulk segregation analysis. I understood that the parent genome is not needed in the current version of the program, but as I have, I suppose it would be better to use it. In this case, the parents.tsv is the SNP calling of both parents using one of the parents as a reference? Another question, I see the figure generated are "Number of SVs" and and not "Number of SNPs" as in the papers. Am I using the correct version of the program?

Thanks very much

Non-numeric contig IDs aren't tolerated

Hi,

In the process of running this software, I have run into the below error.

ValueError: invalid literal for int() with base 10: 'NC_052499.1'

I've changed the code as a workaround i.e.,

#peaklst = sorted(peaklst, key = lambda x: (int(x[0]), int(x[1])))
peaklst = sorted(peaklst, key = lambda x: (x[0], int(x[1])))

It might be nice if it supported non-numeric contig IDs by default though. Regardless, I thought I'd bring this to attention in case anyone else has a similar problem.

test files

Hi!
I want to test the program to understand how it works and what are the outputs. So, I'll use one of the files that you provided to test. I'm just wondering which are the values for these parameters: -p popstrct -b fbsize,sbsize . Thanks very much!

Chromosome size

My reference genome still has a lot of smaller contigs. We will make a new HI-C assembly and linkage map later in the year.
In my first run I used the defaults (default='2000000,10000') and chrs smaller than 2510000 bp were not included/ not suitable for analyses, can you recommend the smallest window and increment that will be useable for a deeper scan and not return an invalid lambda error?
Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.