Comments (11)
No, the HiFi model doesn't support Indel calling at the moment. I don't have any real HiFi cancer data, thus unable to tell how the model performs on SNV, let alone the more challenging somatic indels. If you can do any benchmarks on the HiFi model on your side, kindly let us know the performance. If SNV calling works, I will move on to indel calling.
from clairs.
We were not using TVAF
in the SEQC2 VCF for filtering. There are multiple TVAF
values in the VCF, but instead of using them, we used the TVAF calculated in our own benchmarking datasets.
I've just updated the README, specifically the "Performance figures" section, to make it clearer (my previous writing might have caused your confusion).
For the Illumina dataset, now it shows
- 50-fold HCC1395 (tumor) and 40-fold HCC1395BL (normal) of Illumina NovaSeq 6000 data
- Truth: 39,447 high confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset
For the ONT dataset, now it shows
- 70-fold HCC1395 (tumor) and 45-fold HCC1395BL (normal) of ONT R10.4.1 data
- Truth: 31,444 high confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset
We think that the use of platform-specific TVAF helps to better reveal the algorithm performance itself across platforms. To do so, we used the command below
pypy3 clairs.py compare_vcf \
--truth_vcf_fn high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz \
--input_vcf_fn output.vcf.gz \
--bed_fn High-Confidence_Regions_v1.2.bed \
--output_dir benchmark \
--input_filter_tag 'PASS' \
--normal_bam_fn $NORMAL_BAM \
--tumor_bam_fn $TUMOR_BAM \
--min_af 0.05 \
--threads 48 \
--output_best_f1_score \
--debug #Output all quality score cut-off
For 30x HIFI data, my wild guess is the number of SNVs with TVAF≥0.05 will be somewhere around 35k, which is lower than 39k, thus will bump up the recall rate.
from clairs.
Could it be that at 60X there's a higher chance to see sequencing error and that's penalized significantly in the current statistical implementation
It is more likely because the HiFi training data we used to train the Sequel 2 model is just ~30-40x HG003 54x, HG004 52x. The model was not sufficiently trained with higher coverage samples like 60x, thus making more mistakes. ClairS supports tumor raw coverage up to ~80x, and will downsample the input on-the-fly to ~80x if exceeded. To solve the problem, we can either increase the training data coverage to 80x or above, or tune ClairS to downsample HiFI data to a lower coverage say 30x. Obviously, the former produces better precision and recall.
from clairs.
An experimental model trained with Sequel II HIFI data is available at https://github.com/HKU-BAL/ClairS#pre-trained-models. You might want to give it a test.
from clairs.
Thanks @aquaskyline I saw that model. I was just wondering if it supports indel calls since it's mentioned that currently indel calling is only supported for the ONT F10 model.
from clairs.
@aquaskyline Sure, I will let you know how it goes. Thank you.
from clairs.
One quick question, I noticed that for the ONT benchmark set there's 31444 SNV > 0.05, but when I took the truth set and filtered with TVAF there's 38351 entries:
seqc2_benchmark_set bcftools filter -i 'TVAF>=0.05' high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz | wc -l
38351
May I check with you how did you subset the variants to those 31444 SNVs? Thank you.
from clairs.
from clairs.
Thanks @aquaskyline . While waiting for @zhengzhenxian to get back, I tried ClairS on a set of 30X tumor/normal HCC1395 HiFi dataset and see the following results (with the full 39447 SNVs):
python ~/softwares/ClairS/clairs.py compare_vcf \
--truth_vcf_fn high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz \
--input_vcf_fn output.vcf.gz \
--bed_fn High-Confidence_Regions_v1.2.bed \
--output_dir $(pwd)/hifi_benchmark \
--input_filter_tag 'PASS'
[INFO] Total input records: 26585, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9868 0.665 0.7946 26234 351 13213
Removing the "PASS" filter:
[INFO] Total input records: 29954, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9504 0.7217 0.8204 28468 1486 10979
It looks like precision wise ClairS is performing pretty well, but there is a lot of false-negatives that's bringing recall down. I wonder if there's any parameter we can tune or is that something that requires re-training (happy to collaborate with you).
I should also mention that increasing tumor coverage to 60X (normal at 40X) showed similar metrics, so I don't think it's a matter of coverage:
# With PASS filter
[INFO] Total input records: 26268, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9908 0.6598 0.7921 26027 241 13420
# Without PASS filter
[INFO] Total input records: 29522, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9625 0.7203 0.824 28415 1107 11032
from clairs.
Here's the performance with the AF filter as per instruction (I changed input_filter_tag to None just to see the performance at all cut-offs). It looks like at 60X there's a sharp increase in FN and a modest improvement in FP.
Could it be that at 60X there's a higher chance to see sequencing error and that's penalized significantly in the current statistical implementation? I think this is a reasonably performance, thou, as the absolute number of TP and FP look pretty good. And this is using Sequel model on Revio, so I think there's also going to be some improvement in that with a new model?
30X tumor 30X normal
[INFO] Total input records: 29950, truth records: 34457, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9505 0.826 0.8839 28463 1482 5994
SNV(Best F1) 0.9781 0.8215 0.8930 28305 633 6152 34457
5 0.9781 0.8215 0.8930 28305 633 6152 34457
4 0.9770 0.8220 0.8929 28325 666 6132 34457
3 0.9763 0.8223 0.8927 28334 689 6123 34457
2 0.9758 0.8224 0.8926 28338 702 6119 34457
1 0.9757 0.8224 0.8925 28338 707 6119 34457
6 0.9803 0.8160 0.8906 28118 566 6339 34457
0 0.9505 0.8260 0.8839 28463 1482 5994 34457
7 0.9838 0.7911 0.8770 27260 448 7197 34457
8 0.9868 0.7614 0.8596 26236 351 8221 34457
9 0.9895 0.7278 0.8387 25079 266 9378 34457
10 0.9923 0.6888 0.8131 23733 185 10724 34457
11 0.9945 0.6399 0.7787 22050 123 12407 34457
12 0.9969 0.5822 0.7351 20061 63 14396 34457
13 0.9983 0.5133 0.6780 17686 31 16771 34457
14 0.9988 0.4334 0.6045 14935 18 19522 34457
15 0.9993 0.3436 0.5114 11839 8 22618 34457
16 0.9998 0.2510 0.4012 8648 2 25809 34457
17 0.9996 0.1615 0.2780 5564 2 28893 34457
18 0.9997 0.0839 0.1549 2892 1 31565 34457
19 1.0000 0.0321 0.0623 1107 0 33350 34457
20 1.0000 0.0074 0.0147 256 0 34201 34457
21 1.0000 0.0007 0.0013 23 0 34434 34457
22 1.0000 0.0001 0.0001 2 0 34455 34457
60X tumor 41X normal:
[INFO] Total input records: 29516, truth records: 36713, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9627 0.7739 0.858 28413 1101 8300
SNV(Best F1) 0.9844 0.7700 0.8641 28268 448 8445 36713
5 0.9844 0.7700 0.8641 28268 448 8445 36713
4 0.9834 0.7703 0.8639 28280 478 8433 36713
3 0.9827 0.7705 0.8637 28286 499 8427 36713
2 0.9822 0.7705 0.8636 28287 512 8426 36713
6 0.9862 0.7667 0.8627 28148 395 8565 36713
0 0.9627 0.7739 0.8580 28413 1101 8300 36713
7 0.9886 0.7416 0.8474 27225 314 9488 36713
8 0.9908 0.7090 0.8266 26030 241 10683 36713
9 0.9928 0.6749 0.8036 24779 180 11934 36713
10 0.9940 0.6329 0.7734 23237 141 13476 36713
11 0.9960 0.5818 0.7346 21360 85 15353 36713
12 0.9973 0.5262 0.6889 19319 53 17394 36713
13 0.9980 0.4614 0.6310 16938 34 19775 36713
14 0.9988 0.3920 0.5630 14391 17 22322 36713
15 0.9990 0.3162 0.4804 11610 12 25103 36713
16 0.9998 0.2319 0.3764 8513 2 28200 36713
17 1.0000 0.1589 0.2742 5832 0 30881 36713
18 1.0000 0.0968 0.1766 3555 0 33158 36713
19 1.0000 0.0488 0.0930 1790 0 34923 36713
20 1.0000 0.0150 0.0295 550 0 36163 36713
21 1.0000 0.0015 0.0030 55 0 36658 36713
22 1.0000 0.0001 0.0002 4 0 36709 36713
23 1.0000 0.0000 0.0001 1 0 36712 36713
from clairs.
@aquaskyline Thanks for your comment. That makes sense. I guess at this stage there's not much I can do from my end, but I should mention that I will be happy to help with the need for data and benchmarking if that's of interest to your group. Feel free to email me directly at [email protected] moving forward. We're hoping to plug the gap in small variants calling for HiFi cancer dataset as soon as possible :)
from clairs.
Related Issues (20)
- Option to call SNPs only HOT 1
- Haplotype filtering step keep stuck HOT 4
- question: model for 5khz data HOT 4
- Nondeterminism in ClairS output HOT 1
- Germline variants present in output.vcf HOT 1
- Question: comparison with DRAGEN Somatic HOT 1
- Docker latest version HOT 1
- [Ask for insights on Illumina results regarding ClairS workflow/design choices] HOT 5
- [Inquiry for train dataset generation procedure] HOT 2
- Questions Regarding Heterozygous Variants, Somatic Mutations, and Phasing in ClairS Usage HOT 4
- add v4.3.0 model for clair3 params HOT 6
- sh files for data preprocessing HOT 1
- Question in training data label generation code - get_candidates.py HOT 2
- Enhancing somatic variant calling and execution speed HOT 5
- ClairS crashing with spaces in input file name HOT 2
- tmp folders not being deleted after calling HOT 2
- ClairS quits with exit code 0 but no output, no error logged HOT 5
- Adding Normal Sample GT to the VCF file HOT 3
- samtools index: failed to open error HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clairs.