deploid-dev / deploidpaper Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 181.58 MB

Makefile 0.81% TeX 65.62% R 28.39% Shell 2.61% Python 2.57%

deploidpaper's People

Contributors

Watchers

deploidpaper's Issues

post = likelihood \times prior

paper structure

Model
Inference method
Implementation details

simulation study for computing the errors

Number of SNPs

number	group
63348	africa 1
92780	africa 2
37758	africa 3
45621	africa 4
8070	asia 1
15761	asia 2
15144	asia 3

deconvolute elife asia/africa mixed samples

e-life samples to deconvolute (present in Pf3K)

(PF0077-C, PF0624-C, PF0651-C)

I used Afica panel 2, 4, 4 respectively to deconvolve the samples, with ten most diverse asia samples. I only did for chromosome 1 and 14

They are present two strains, but it seems to be difficult to separate the two. Red color denotes painting probabilities from Africa, orange is Asia

PF0624-C

PF0651-C

need to do multiple runs, and get a range for the proportions

compute the expected error rate from panel and the data

PG0390-C

length(which(panel$dd2gt.from.regression == 1 & coverage$altCount==0))/length(panel$dd2gt.from.regression)
[1] 0.007108239
length(which(panel$dd2gt.from.regression == 0 & coverage$altCount>0))/length(panel$dd2gt.from.regression)
[1] 0.01540118

Comments to the Author
The authors have developed a tool, DEploid, to infer the number of strains in a mixed infection of Plasmodium falciparum, the relative proportion of each strain and their respective haplotypes. A tool like this would be extremely useful for analysis of malaria datasets and could greatly advance our knowledge of this disease.

The deconvolution method is another adaptation of the Li and Stephens model, this time applied to the deconvolution/phasing of SNPs in isolates with unknown number of component clones, with unknown contributions. This is the type of data that is typically encountered by malaria researchers and the authors make use of the MalariaGen Pf3K dataset to demonstrate its application. The paper comes with software: dePLOID.

This is a nice application of a well-established model for phasing. The paper is written in a very concise manner and requires some gap filling to become a bit more comprehensible. The results leave several questions open. While the authors do assess the performance of DEploid, general descriptions of the analysis process, data processing steps and model parameters are vague and/or lacking. The webpage describing DEploid is nicely done.

Several of the figures are too small. For example Figure 2 could do with an excerpt zooming in on a chromosome to show the results more clearly. Another alternative is to remove the black bars from the image to make the founder haplotypes more visible. Figure S2.2 is far too small. Since this is supplementary data there is no penalty to spread these figures out over several pages.
- We have updated all figures, and removed black bars from Figure 2 (tagged by REV1.1 in the main text and supplementary material).
Although the focus on the paper is on the deconvolution/phasing a lot of results are given to comparing proportion estimation, for which there are already several well performing methods available (although dePLOID beats their accuracy marginally, albeit at much greater CPU time cost). The paper would have benefitted from further evaluation of phasing performance, which is examined in detail with 1) a generated mixture experiment, and 2) a simulated data set, looking at coverage. A more semi-realistic study would use data from the largest Pf3K sub population, taking the approximately clonal samples and simulating some data from these with the proposed uniform recombination model to generate ‘truth’ as well as creating some mixtures from this data. This would provide an assessment in more realistic setting with variable coverage and data quality. Switch and genotyping errors could then be evaluated in this setting.
- We conducted simulation studies of mimicing pf3k samples in the main text (REV1.2.1) to investigate the switch and genotype errors. and simulation studies to investigate the coverage requirement in supplementary materials (REV1.2.2).
Furthermore, it is not at all clear that this software will be of much use to most researchers working with WGS (or at the very least high throughput SNP genotyping data) since reference panels are crucial to this algorithm and there may not be sufficient reference samples at hand.
- addressd by REV1.3
Could the authors also comment in the paper on the potential use of in silico haplotypes derived from the read pairs themselves to help with the phasing here?
- addressd by REV1.4

2.2 Model:

You assume a prior in which the haplotypes of the n strains are independent of each other. What happens if the input sample contains related strains?
- This is an excellent point. We do not actually assume the n strains are independent to each other within the same sample. The assumption is that reference strains are independent from each other. But the deconvoluted strains can copy from the same reference strain independently. In fact, we can deconvolute inbred samples with the current implementation, see the following example. However, the current method struggles more complex inbreeding, and low coverage data. We are improving the method with a new component of IBD, which has been implemented, but as it is different, and we will discuss this in the new paper, where we will apply the new method to build a recombination map from the field samples.

3.1 Accuracy:

How many SNPs were used in the analysis?
- We discuss this using the filtering step, addressd by REV1.6. Extracted 18570 high quality biallelic SNPs from Pf3k data, after filtering we use 17,530 for the experiment.
Why do you assume there are at most 3 strains present in the mixtures when the default value is 5 strains? Do your results differ when you assume 5 strains are present?
- Very good point. The inference is robust, with subtle differences when assuming different number. We address by REV1.7 in the updated Figure 2.
- How many reference strains were used for the analysis presented in Table 2 and what strains were these? Were they the baseline reference haplotypes for the four parent strains? Using the four parent strains should produce the best possible results which is unrealistic with field isolates.
- addressd by REV1.8
- Figure 2: It would be useful to include how many SNPs were included for analysis on chromosome 14 in the figure legend.
- 2369, fixed in the updated figure axis, see REV1.9

3.2 Comparison to existing methods:

- COIL uses genotype information. How did you generate the genotype data used here? Perhaps more information on data processing would be useful.
- Addressd by REV1.10 in the supplement.
- BEAGLE requires a reference dataset to infer haplotype phase, typically a large reference dataset. What reference dataset was used? Parameters used in this analysis, and the analysis of other software, would also be useful.
- addressd by REV1.11.1 and REV1.11.2
- Figure 3: pfmix infers the number of strains and their proportions, therefore please add the numbers of strains estimated by pfmix to Figure 3 panel (a) for comparison.
- Unfortunately we couldn't get pfmix to work on the same dataset. With 4000 iterations, the method stopped with the error of the following

 Error in ans[ok3] <- dbinom(x = x[ok3], size = size[ok3], prob = prob1,  : 
  NAs are not allowed in subscripted assignments
Calls: run.mcmc ... mh.mcmc -> calc.llk -> dbeta.binom.zi -> dbetabinom.ab
Execution halted

when reduce the number of iterations, pfmix returns incorrect results. We feel this is unfair to pfmix, therefore, we modified the code, and skip the model selection step, and fix the number of strains, and infer the proportions only, addressd by REV1.12

- Figure 3: DEploid infers 6 samples as containing 3 strains when they really only contain 2 strains (as shown in Table 2). Why are these not represented in Figure 3 panel (a)?
- The overfitting of these six samples was due to markers with both high frequencies for reference alleles and alternative alleles. It is fixed after the filtering step. We show this online at https://github.com/mcveanlab/DEploid/wiki/FAQ#data-filtering. And we explain another type of overfitting for our program, and this can be adjusted by running the program with a different value for the paramter sigma addressd by REV1.13 in supplemary material.
- Figure 3: cannot read the figure legends and axes.
- addressd by REV1.14

Concerns:

- A typical reference panel would contain haplotypes from field samples constructed from the user. Therefore, one might expect results similar to Panel I in Figure 2. A reference panel like this does not seem to affect estimates of the number of strains or their relative proportions in an infection, however haplotype inference does not look flash. Perhaps address this in the discussion? If haplotype inference is not reliable then this tool is not terribly useful as other popular tools are available to estimate strain numbers and their relative proportions.
- addressd by REV1.15
- Was any filtering of poor quality SNPs performed? This would seem prudent for haplotype phasing.
- addressd by REV1.16
- Is the Gibbs update for the pair of haplotypes performed always in tandem with the single haplotype update?
- addressd by REV1.17

Supplementary material:

Figure S2.2: inconsistencies in WSAF in figures (a) and (b). Histogram of WSAF in panel (a) shows clustering around 0.3 and 1 while the distributions of WSAF across each chromosome in panel (b) cluster around 0.3 and 0?
- Addressed in caption, we actually exclude points of WSAF strictly equal to 0s and 1s.

Minor comments:

Figure S2.2 ‘blue dots’
- Fixed by REV1.19
P6 O’Brien (2016)
- Fixed by REV1.20
P6 – BEAGLE would implicitly assume a 50:50 distribution of alleles with its diploid assumption.
- Fixed by REV1.21
P6. “ten most different” – different how? Define.
- We compute the pairwise differences between strains, and choose ten strains that have the greatest distance. addressd by REV1.22

need to rerun painting for supplement

reviewer 3 comments

Reviewer: 3

Comments to the Author
Review of Zhi et al "Deconvolution of multiple infections in Plasmodium falciparum from high
throughput sequencing data"

This paper describes how to infer the mixture decomposition of multiple strains of haploid organisms when multiple, related strains may be present in the same sample. This is an important problem in bacterial genetics, as argued by the authors, and they present a workable solution to this. The solution used, to use a copying model and perform markov-chain monte carlo analysis to extract out the appropriate details for the copying model, is an interesting novel application of these methods. To the best of my understanding it is correctly implemented and performs a useful job.

So I'm generally positive about this paper. I don't have major concerns, but as it stands it is not very easy to read. It is laid out in the classic mathematical style, which is to say to get to the results the reader has to slog through a lot of complex descriptions of mcmc updates, which have not been given any context or intuition. The writing is not bad but the ms would benefit hugely from a) a reorganisation to hide the gore from an interested biological-minded reader, and b) some effort to explain the details in intuitive terms. Some specific suggestions are listed below.
- I have moved the math part to the supplementary material.
More technically, I found the technical details to be slightly unsatisfactorally explored. Specific concerns were the arbitrary value of G=20 (page 4) which scales the recombination rate. This is pretty unconvincing. I agree that the model usually allows for some misspecification of the recombination rate but something much better could be done. Either do the right thing (inference of G by EM or analogously) or show that it is insensitive.
- In practice, we deconvolve over 1 million markers of field samples, we use a value of G = 20 to ensure small values for recombination probabilities between two markers, with a mean of 0.015. (Tagged by REV3.2)
I also disliked the anecdotalaity of Figure 2 - I was not clear what the general takehome message was meant to be, and the plot with its many black bars is quite confusing.
- Removed the black bars. This is is similar to reviewer 1's comment, tagged by REV3.3.1 and REV3.3.2. The takehome message is meant to be that when we include more relevant strains in the reference panel, it improves the deconvolution result with both fewer switch and genotype errors.

Minor comments:

Figure 3: c is a noisy plot. It would be much clearer if shown with a smoothing. It would inform the reader to say what the take home message of all plots should be in the legend.
- Fixed this in the updated figure, tagged by REV3.4.1 and REV3.4.2
Page 2 right: what is c? it isn't defined? In general the model section needs some effort in clarification.
- "c" reflects how much data is available. The average coverage for the data (at the markers we deconvolute) is above 100. Hence we set c = 100. We address this by REV3.5 in the main text
Page 2: sp: inversley
- Thanks for spotting this, addressed by REV3.6.
Page 3: titre: this is not a common term. What is wrong wit concentration? I think this is what you mean anyway? I find no evidence that titre has this meaning in statistics, only in chemistry, though I appreciate that there are many fields I'm not familiar with.
- Yes and no. The log titre behaves in a similar way as the concentration parameter -- the same expectation expression. But this is strickly not the same as the Dirichlet distribution, which will result in a complicated form when computing the hastings ratio for the Metropolis–Hastings algorithm, and the moves between x and x' is not symmtrical. We try to avoid the confusion with Dirichlet process, hence not calling it the concentration parameter.
Page 4: "Such erroneous markers are not currently inferred by DEploid, though this could be included in future versions." If it is easy, do it. If it is not easy, don't offer. In my experience very few pieces of academic software are maintained and developed in this way.
- We apply the filtering step to exclude these markers, addressed by REV3.8, in main text and supplementary material. This software will be maintained and developed as part of the Pf3k project. As the project finishes, it will likely be maintained through the MalariaGEN network.

benchmark scripts

exactly what info was used for beagle, shapeit, pfmix, and COIL

wsaf of sim1 vs sim2

in short, panel is not perfect, the panel is created from the regression model

rm tmp files and replot

rm /well/mcvean/joezhu/pf3k/pf3k_5_1_final/dEploidOut/PG0*-C/PG0*-C_seed*k2.single2*
rm /well/mcvean/joezhu/pf3k/pf3k_5_1_final/dEploidOut/PG0*-C/PG0*-C_seed*k2.single3*
rm /well/mcvean/joezhu/pf3k/pf3k_5_1_final/dEploidOut/PG0*-C/PG0*-C_seed*k2.single4*

Reviewer 2 comment

Comments to the Author
The reconstruction of genomic sequences from mixed populations of pathogens from NGS data is highly relevant, as the most abundant haplotype not necessarily is most relevant or explains the infection phenotype. The ability to determine the multiplicity of infections, strain ratios and retrieving the haplotypes is highly wanted for surveillance and treatment of infectious diseases. Existing methods for this purpose are rare and limited, which results in high demand for new and better methods in this area.
I therefore support the publication of this manuscript in principle. It presents an interesting method and is applicable to data of a highly relevant pathogen. However, the limitation to Plasmodium falciparum is also a fundamental problem. In bioinformatics, methods need to be as generic as possible. It would be impracticable to develop specific methods for the genotyping of each individual pathogen. The manuscript and the bioinformatics community would very much benefit from additional data (based on simulation and real NGS reads), which indicates the performance of dEploid for other species (see below).

Major points

1. Experimental validation (Table 2). The mixed samples used are well known, therefore the choice of reference genomes for the reference panel and the samples for “PLAF” is obvious. How would that principle extend to unknown mixtures? The performance of the tool with “unknown” simulated datasets and a larger number of different strains used for the PLAF would be crucial to know. Also, because of the MCMC sampling, the percentages shown could vary when re-run with the same parameters. Instead of single values, distributions (e.g. means and variances) need to be shown.
1. We are releasing tons of data (Pf3k and the next data release, 7K samples), so building reference panels shouldn't be a big concern for the majority of cases (same for PLAF estimation).
2. The percentages shown could vary when re-run with the same parameters. This is a very good point. Yes, when rerunning the program, we the proportion value do vary. We repeat the deconvolution 30 times, and show how it varies when estimating the effective number of strains. addressed by REV2.1
2. Application to other species (Discussion). It is not clear how the concept can be applied to data of species from different biological domains like stated in the discussion: “bacterial or viral pathogens”. To use dEploid for other organisms the composition of the populations would be required to construct a reasonable PLAF matrix. In an attempt to apply dEploid to bacterial data with a PLAF and panel constructed from 26 reference genomes, we were able to retrieve the relative abundance of the most abundant sample in mixtures of up to 3 strains (min. 10%, max 80%) in most cases. The results (Fig. 1) were varying strongly when re-running the tool on the same dataset. The determination of multiplicity of infections and the haplotype reconstruction were not successful.
- In our experience, we use all available allele frequencies to compute the PLAF. In the case of falcipruim, since it highly diverse among geographical regions. We compute the PLAF and build reference panels by seven geographical regions when analizying pf3k field samples. For different species and dataset, we suspect a more suitable filtering step should have been taken before deconvolution. However, this is diffiucult to anticipate without any data exploration in practice. In the supplement, we provide examples of how filtering step works for our experiment, and hope it will inspire other filtering steps to be taken when analyzing another different oganism. In the supplement, we show examples for adjusting the parameter sigma to improve the deconvolution for very imbalanced samples.
- In an attempt to the Plasmodium vivax (Pearson et al., 2016) deconvolution, we found DEploid works well for most samples. However, it struggles with samples with both low coverage and high inbreeding. We have developed a new method accordingly, implemented with the "-ibd" flag. We are in preparation of another manualscript for the new method and its application.

Minor points

1. Other sequencing technologies. As the error rate can be adjusted in dEploid, how well would the tool perform on data originating from different sequencing technologies (e.g. PacBio or Oxford Nanopore Technologies)?
- Thanks for this. We are in fact in progress to work with ONT data. We address this in the Discussion, and tagged by REV2.3.
2. InDels and structural variants. When reconstructing haplotypes, indels and structural variation also need to be considered, while dEploid only reconstructs SNPs. This should be address in the discussion.
- We address this in the Discussion by REV2.4

deploid-dev / deploidpaper Goto Github PK

deploidpaper's People

Contributors

Watchers

deploidpaper's Issues

e-life samples to deconvolute (present in Pf3K)

Recommend Projects

Recommend Topics

Recommend Org