hku-bal / clairs Goto Github PK
View Code? Open in Web Editor NEWClairS - a deep-learning method for long-read somatic small variant calling
License: BSD 3-Clause "New" or "Revised" License
ClairS - a deep-learning method for long-read somatic small variant calling
License: BSD 3-Clause "New" or "Revised" License
Hello,
Could you make it available SEQC2 ONT v10.4.1 full bams (or even fastq) for both Tumor and Normal samples available to run ClairS?
Keyur
Hello ClairS Team,
Firstly, I'd like to express my gratitude for your outstanding research and for making the code available to the community.
In Illumina dataset, I am curious on what point ClairS provides advantage over other baselines such as VarNet and NeuSomatic. (workflow/design or synthetic dataset construction or larger or better model architecture)
Have you compared SNP/INDEL calling performance between ClairS and (VarNet or Neusomatic) when they are trained on identical datasets? I am curious whether the amount and quality of the training dataset have impacted the performance the most or not. (I saw the ablation study on your supplementary material regarding Nanopore dataset)
What is the key factor that improved the performance over Neusomatic and VarNet?
Thank you for your time and insights.
Best regards,
Quito418
Hi @aquaskyline and @zhengzhenxian,
Thank you for releasing an early access version of ClairS. Just playing around with this, I had a quick clarification question:
I was wondering what the QUAL column value signifies. Broadly, I believe the "QUAL value reflects how confident we are that a site displays some kind of variation considering the amount of data available", in this case, is it the confidence that a call is somatic? And how do you generate this QUAL score?
Thanks.
Hi Ruibang,
Since nanopore sequencing software updated to 5kHz sampling mode, I am wondering if there will be a pre-trained model for 5kHz data. Can I still use --platform ont_r10 (ont_r104_e81_sup_g5015) for now? Thank you.
Cheers
Jia
The command docker pull hkubal/clairs:latest was pulling the 1.4 version I had to use specific tag docker pull hkubal/clairs:v0.1.5
to get the latest version after I checked the release number
Hi,
ClairS keep stuck at this command after completing phasing for hours on ONT data. No output in 4_HAP_FILTER.log
log file either.
I am running second time, and seeing same issue.
[INFO] STEP 4: Haplotype filtering
[INFO] RUN THE FOLLOWING COMMAND:
( pypy3 /opt/bin/clairs.py haplotype_filtering --tumor_bam_fn /data/tahmad/devel/ClairS_HCC1437/tmp/clair3_output/phased_output/tumor_ --ref_fn /data/tahmad/alignment/grch38_chr.fasta --germline_vcf_fn /data/tahmad/devel/ClairS_HCC1437/tmp/clair3_output/clair3_tumor_output/merge_output.vcf.gz --pileup_vcf_fn /data/tahmad/devel/ClairS_HCC1437/tmp/vcf_output/pileup.vcf --full_alignment_vcf_fn /data/tahmad/devel/ClairS_HCC1437/tmp/vcf_output/full_alignment.vcf --output_dir /data/tahmad/devel/ClairS_HCC1437/tmp/vcf_output --samtools samtools --threads 56 ) 2>&1 | tee /data/tahmad/devel/ClairS_HCC1437/logs/4_HAP_FILTER.log
My clairS run command:
singularity run -B /data/tahmad /data/tahmad/images/clairs_latest.sif /opt/bin/run_clairs --threads 56 --phase_tumor True --longphase True --tumor_bam_fn /data/tahmad/devel/bams_haplotagged/HCC1437_ONT.bam --normal_bam_fn /data/tahmad/devel/bams_haplotagged/HCC1437BL_ONT.bam --ref /data/tahmad/alignment/grch38_chr.fasta --output_dir /data/tahmad/devel/ClairS_HCC1437 --platform ont_r10
I see following BAMs and phased VCFs:
ls -lh /data/tahmad/devel/ClairS_HCC1437/tmp/clair3_output/phased_output/tumor_
tumor_chr10.bam tumor_chr17.bam.bai tumor_chr3.bam tumor_phased_chr10.vcf.gz.tbi tumor_phased_chr18.vcf.gz tumor_phased_chr3.vcf.gz.tbi
tumor_chr10.bam.bai tumor_chr18.bam tumor_chr3.bam.bai tumor_phased_chr11.vcf.gz tumor_phased_chr18.vcf.gz.tbi tumor_phased_chr4.vcf.gz
tumor_chr11.bam tumor_chr18.bam.bai tumor_chr4.bam tumor_phased_chr11.vcf.gz.tbi tumor_phased_chr19.vcf.gz tumor_phased_chr4.vcf.gz.tbi
tumor_chr11.bam.bai tumor_chr19.bam tumor_chr4.bam.bai tumor_phased_chr12.vcf.gz tumor_phased_chr19.vcf.gz.tbi tumor_phased_chr5.vcf.gz
tumor_chr12.bam tumor_chr19.bam.bai tumor_chr5.bam tumor_phased_chr12.vcf.gz.tbi tumor_phased_chr1.vcf.gz tumor_phased_chr5.vcf.gz.tbi
tumor_chr12.bam.bai tumor_chr1.bam tumor_chr5.bam.bai tumor_phased_chr13.vcf.gz tumor_phased_chr1.vcf.gz.tbi tumor_phased_chr6.vcf.gz
tumor_chr13.bam tumor_chr1.bam.bai tumor_chr6.bam tumor_phased_chr13.vcf.gz.tbi tumor_phased_chr20.vcf.gz tumor_phased_chr6.vcf.gz.tbi
tumor_chr13.bam.bai tumor_chr20.bam tumor_chr6.bam.bai tumor_phased_chr14.vcf.gz tumor_phased_chr20.vcf.gz.tbi tumor_phased_chr7.vcf.gz
tumor_chr14.bam tumor_chr20.bam.bai tumor_chr7.bam tumor_phased_chr14.vcf.gz.tbi tumor_phased_chr21.vcf.gz tumor_phased_chr7.vcf.gz.tbi
tumor_chr14.bam.bai tumor_chr21.bam tumor_chr7.bam.bai tumor_phased_chr15.vcf.gz tumor_phased_chr21.vcf.gz.tbi tumor_phased_chr8.vcf.gz
tumor_chr15.bam tumor_chr21.bam.bai tumor_chr8.bam tumor_phased_chr15.vcf.gz.tbi tumor_phased_chr22.vcf.gz tumor_phased_chr8.vcf.gz.tbi
tumor_chr15.bam.bai tumor_chr22.bam tumor_chr8.bam.bai tumor_phased_chr16.vcf.gz tumor_phased_chr22.vcf.gz.tbi tumor_phased_chr9.vcf.gz
tumor_chr16.bam tumor_chr22.bam.bai tumor_chr9.bam tumor_phased_chr16.vcf.gz.tbi tumor_phased_chr2.vcf.gz tumor_phased_chr9.vcf.gz.tbi
tumor_chr16.bam.bai tumor_chr2.bam tumor_chr9.bam.bai tumor_phased_chr17.vcf.gz tumor_phased_chr2.vcf.gz.tbi
tumor_chr17.bam tumor_chr2.bam.bai tumor_phased_chr10.vcf.gz tumor_phased_chr17.vcf.gz.tbi tumor_phased_chr3.vcf.gz
Hello,
I come across an issue when the software is provided input files with spaces in their path.
$ ~/Software/ClairS/run_clairs --tumor_bam_fn Test\ data/demo_tumor.remap.bam --normal_bam_fn Test\ data/demo_tumor.remap.bam --bed_fn Test\ data/demo.bed --ref_fn Test\ data/GCA_000001405.15_GRCh38_no_alt_analysis_set_chr20.fna --threads 4 -o Test\ data/TESTING_BREAK/ClairS_test -p ont_r10_guppy
[INFO] Create folder Test data/TESTING_BREAK/ClairS_test
[COMMAND] /home/talentia/Software/ClairS/run_clairs --tumor_bam_fn Test data/demo_tumor.remap.bam --normal_bam_fn Test data/demo_tumor.remap.bam --ref_fn Test data/GCA_000001405.15_GRCh38_no_alt_analysis_set_chr20.fna --threads 4 --platform ont_r10_guppy --output_dir Test data/TESTING_BREAK/ClairS_test --bed_fn Test data/demo.bed
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/logs
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/split_beds
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/candidates
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/pileup_tensor_can
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/fa_tensor_can
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/vcf_output
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/tmp_vcf_output
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/logs/clair3_log
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/clair3_output/phased_output
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/clair3_output/vcf
gzip: /home/UID.gz: No such file or directory
gzip: name/Test/Test.gz: No such file or directory
gzip: data/demo.bed.gz: No such file or directory
[INFO] --include_all_ctgs not enabled, use chr{1..22,X,Y} and {1..22,X,Y} by default
[WARNING] No contig in --bed_fn was found in the reference, contigs in BED /home/UID name/Test/Test data/demo.bed:
This seems to be related to how the workflow decompresses the input files, particularly the input BED and genotyping VCF here and here.
This is particularly relevant when the space is within the user name, making it impossible to run the software.
Thanks
Andrea
VCFs are not generated for contigs where no variants are found, which breaks select_hetero_snp_for_phasing
.
clairs_output/vcf contains these files:
chr1.vcf chr16.vcf chr21.vcf chr22.vcf chr4.vcf chr5.vcf
Errors pasted from some of 1_select_hetero_snp_for_phasing.log
:
[INFO] Total HET SNP calls selected: chr22: 119, not found:27, not match:0, low_qual_count:0. Total normal:57 Total tumor:146, pro: 0.815068493150685
Traceback (most recent call last):
File "/opt/bin/clairs.py", line 120, in
main()
File "/opt/bin/clairs.py", line 114, in main
submodule.main()
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 157, in main
select_hetero_snp_for_phasing(args)
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 118, in select_hetero_snp_for_phasing
pro = len(pass_variant_dict) / len(tumor_qual_dict)
ZeroDivisionError: division by zero
Traceback (most recent call last):
File "/opt/bin/clairs.py", line 120, in
main()
File "/opt/bin/clairs.py", line 114, in main
submodule.main()
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 157, in main
select_hetero_snp_for_phasing(args)
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 118, in select_hetero_snp_for_phasing
pro = len(pass_variant_dict) / len(tumor_qual_dict)
ZeroDivisionError: division by zero
[INFO] Total HET SNP calls selected: chr1: 220, not found:66, not match:0, low_qual_count:0. Total normal:73 Total tumor:286, pro: 0.7692307692307693
[INFO] Total HET SNP calls selected: chr21: 1, not found:1, not match:0, low_qual_count:0. Total normal:31 Total tumor:2, pro: 0.5
Traceback (most recent call last):
File "/opt/bin/clairs.py", line 120, in
main()
File "/opt/bin/clairs.py", line 114, in main
submodule.main()
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 157, in main
select_hetero_snp_for_phasing(args)
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 118, in select_hetero_snp_for_phasing
pro = len(pass_variant_dict) / len(tumor_qual_dict)
ZeroDivisionError: division by zero
Hi,
I am trying to run ClairS with singularity, but it keeps looking for a model and an executable in my ${CONDA_PREFIX} . First I had to tweak a bit the command from the README, since -v is docker syntax, not singularity, the SIF file goes after the options and you are specifying it twice. This is my first command:
singularity exec \
-B ${TUM_INPUT}:${TUM_INPUT},${GL_INPUT}:${GL_INPUT},${REF_DIR}:${REF_DIR},${OUT}:${OUT} \
clairs_latest.sif \
/opt/bin/run_clairs \
--tumor_bam_fn ${TUM_INPUT}/test.sorted.bam \
--normal_bam_fn ${GL_INPUT}/test.gl.sorted.bam \
--ref_fn ${REF_DIR}/Homo_sapiens_assembly38.fasta \
--threads 32 \
--platform ont_r9 \
--output_dir ${OUT}
After which I get an error:
[ERROR] file ${CONDA_PREFIX}/bin/clairs_models/ont_r9 not found
I then unpacked the models in the folder where ClairS is looking for them but I still get an error:
[ERROR] Cannot find clair3 main entry in ${CONDA_PREFIX}/bin
So is the docker/singularity image self contained? I also tried installing through conda but the environment does not solve with incompatible specifications.
I am not an advanced user of singularity, so might be doing something wrong.
Hello,
I am working with HCC1395 data, analyzing tumor samples at 75x coverage and normal samples at 45x coverage. I utilized Clair3 to process the normal.bam file, generating a normal.vcf. This file was then employed for phasing and haplotagging the tumor.bam, followed by using a somatic mutation caller. The results showed a notable decrease in false positives.
phase and haplotag | Precision | Recall | F1-score | TP | FP | FN |
---|---|---|---|---|---|---|
ClairS germline.vcf | 67.12% | 77.64% | 72.00% | 30626 | 15001 | 8821 |
Clair3 normal.vcf | 72.50% | 77.46% | 74.90% | 30556 | 11593 | 8891 |
In an instance where false positives were converted to true negatives, it was observed that the mutations in the normal sample were heterozygous, whereas in the tumor sample, they were homozygous. This suggests a loss of heterozygosity (LOH) event, making the strategy of phasing and tagging most reads into the same haplotype seem correct. Have you considered this method?
Moreover, I noted in literature that the primary reason for choosing Longphase for phasing is its speed. We still have a speed advantage in haplotagging. ClairS employs parallel acceleration at the chromosome level and we can introduce a feature to specify a range. Could this reduce the training costs for you? I also conducted a haplotag test, and the results do not seem to show any significant differences.
haplotag | Precision | Recall | F1-score | TP | FP | FN |
---|---|---|---|---|---|---|
whatshap v1.7 | 67.12% | 77.64% | 72.00% | 30626 | 15001 | 8821 |
longphase v1.3 | 67.27% | 77.62% | 72.07% | 30617 | 14897 | 8830 |
Have you done any comparison with the latest Dragen Somatic https://www.biorxiv.org/content/10.1101/2023.03.23.534011v1.full.pdf ?
Dear ClairS Team,
While reviewing the code at the following link:
Line 348 in c464c98
find_candidate_match
function.Thank you for your time and support!
Best regards,
Hi, may I know if you're planning to enable somatic indel calls with PacBio HiFi? If not, would you be able to provide an instruction on how to train our own model for ClairS with the latest GIAB data from Revio? They're available here: https://downloads.pacbcloud.com/public/revio/2022Q4/
Thank you.
According to the ClairS documentation in the README, I expect the germline variants to be eliminated in the somatic vcf, and this is consistent with the demo outputs. However, running ClairS on my own data, I see overlaps between the clair3 germline callsets (both normal and tumor) and the final somatic callset. Is this behavior expected? If so, can you provide guidance on how to handle the overlapping calls?
Can you share the sh files for data preprocessing and model training? Like Clair3.
Thank you!
The tmp
folders created by ClairS are not deleted after the run is finished, or if they are, it is not consistent. They take up a significant amount of space and can unexpectedly fill up workspaces.
Similar to Clair3 (issue #40), is it possible to add an option to only call SNPs and ignore indels?
Thanks!
Hi, can you add the option to use the latest clair3 model for germline variant elimination? Thank you.
Dear Clair Team,
I am a member of the longphase development team. Recently, while using ClairS and studying related literature, I have observed some results that raised a few questions I'd like to inquire about.
During the use of ont_quick_demo.sh, I used IGV to observe the vcf in [Figure 1] process.After organizing the data, I noticed that only the tumor samples were heterozygous (as shown in Figure 2). My question is, during the Germline variant calling step, are only the tumor cells identified as confident heterozygous?
From Figure 2, I observed that the SNPs used for longphase actually include some somatic mutations. I would like to confirm if the SNPs used for phasing are indeed a mix of somatic mutations and Germline variants.
In Figure 3, the number of SNPs used for phasing seems insufficient to cover the entire range. Attempting other variant calling software, I found that Clair3 (v1.0.5) results could cover a broader range. After phasing and haplotagging with Clair3 and longphase (v1.6), I obtained the results as shown in Figure 4 (B) (only displaying SNPs that were phased). I noticed that different haplotypes could still be distinguished in ClairS's output (as seen in Figure 5). Have you considered using this method?
For ClairS, tagging appears to be a crucial step. The ideal output would include H1/H2 and H2 (carrying somatic mutations), or H1, H2, H3 (tumor-specific), etc. Would such an approach be beneficial for the training or detection of the ClairS model? If you think it would be helpful, we are considering developing a new version focused on somatic mutations.
Thank you for your time and assistance!
Hi ClairS team,
Do you have plan for providing document for training dataset generation? (like in Clair3)
Thanks!
This has already been resolved by I am sharing the email thread with @zhengzhenxian about this issue here.
ClairS returns slightly different results depending on the order in which bam files are being merged with samtools merge
. While samtools merge does return merged bams with dissimilar ordering, I do not think it should affect variant calling. Could you help me understand a). how these functionally identical bam files are causing differences in ClairS output, and b). how significant are these differences?
Reply: Thank you for providing us with the data. Based on the data you provided, we believe that the issue is due to some of the merged BAMs sharing the same read names(read name and count are shown in the figure). We used "samtools mpileup" to parse all input alignments, and samtools gave different pileup results for different merged ordering, resulting in different variant calling results in Clair3 and ClairS.
Unfortunately, we are unable to fix the issue as it falls under the logic of samtools. However, we have two suggestions that may be helpful. The first is to add a unique prefix for each BAM that needs to be merged. The second is to sort the BAM file names in a specific order, such as alphabetical order, before merging. In our testing, if all read names are unique, the calling gave the same result in multiple runs.
Based on your benchmarking data, nearly 50% of the read names have more than one read support, which has contributed to 34 different calls. We suggest implementing one of the suggestions above to ensure that the merged BAMs have unique read names and therefore produce consistent variant calling results.
Please let us know if you have any further questions or concerns.
Thanks for your quick response to my issue. My team and I are taking your advice to try to force deterministic ClairS outputs in our pipeline. Taking your first suggestion, my team has filtered out non-unique reads and verified with BamUtil diff that two separate bam files are functionally identical -- multiple reads mapped to the same position are still ordered differently. However, calling ClairS on these two bam files again returned different results. Also, I was not able to detect any differences between the pileup files from either the output temp files or my own mpileup outputs. Can you please advise us on the source of the nondeterminism? We do believe that sorting the bam files before merging would work and we are in the process of implementing it in our pipeline, but my team would like to understand the root cause so that we can properly assess the validity of our results.
Regarding the issue of nondeterminism, I would like to clarify that it stems from samtools itself. Even after removing duplicated reads, we still observe differences in the read disorder. To investigate this further, I examined the Clair3 results in ${OUTPUT_DIR}/tmp/clair3_output/clair3_tumor_output/merge_output.vcf.gz and compared them with the samtools mplieup result of the first different record (chr1:143185622). Unfortunately, I found that the output is still different due to the different read order(figure 1 and figure2).
The issue with the read order affects the ClairS VCF result because we encode the alignment into a spatial read-level representation in our full-alignment model. Different read orders lead to different input arrangements and thus giving different outputs for some variants.
To address this issue, I would still recommend fixing the BAM order in the merging step to avoid distinct read orders. Please feel free to let us know if you have any further questions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.