Giter Site home page Giter Site logo

giab_data_indexes's Introduction

giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle (GIAB) project. The indexes for sequences and alignments are also available under: https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_indexes .


AshkenazimTrio

Son:HG002     https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/
Father:HG003    https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG003_NA24149_father/
Mother:HG004     https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG004_NA24143_mother/

Sequencing Platform Sequence Alignment
Illumina WGS 2x150bp 300X per individual All     HG002     HG003     HG004 novoalign:   All     HG002     HG003     HG004
Illumina 6KB Matepair All     HG002    HG003     HG004 bwamem:hg19   All     HG002     HG003     HG004
Illumina WGS 2X250bp All     HG002     HG003     HG004 isaac:hg19   All     HG002     HG003     HG004
novoalign:   All    HG002     HG003     HG004
Moleculo All     HG002     HG003     HG004
Illumina Whole Exome - bwamem:hg19   All     HG002    HG003     HG004
SOLiD 60x for son All     HG002 LifeScope:hg19   All     HG002
CompleteGenomics - CGAtools:hg19   All     HG002     HG003     HG004
Ion Proton 1000x Exome - TMAP:hg19   All     HG002     HG003     HG004
10X Genomics - bwamem:hg19   All     HG002     HG003     HG004
10X Genomics ChromiumGenome All     HG002 LongRanger2.0:hg19   All     HG002     HG003     HG004
BioNano All:bnx     HG002:bnx     HG003:bnx     HG004:bnx All:cmap     HG002     HG003     HG004
PacBio 70x/30x/30x All     HG002     HG003     HG004
All:hdf5     HG002     HG003     HG004
NGMLR:hg19   All     HG002     HG003     HG004
minimap2:   All     HG002     HG003     HG004
PacBio CCS 10kb All     HG002 pbmm2:hg19   All     HG002
PacBio CCS 11kb All     HG002 pbmm2:hg19   All     HG002
PacBio CCS 15kb All     HG002 pbmm2:hg19   All     HG002
PacBio CCS 15kb_20kb chemistry2 All     HG002 pbmm2:   All     HG002     HG003     HG004
Oxford Nanopore 2D All     HG002 -
Oxford Nanopore ultralong (guppy-V3.2.4_2020-01-22) All     HG002 minimap2:whatshap:hg19   All     HG002
Oxford Nanopore ultralong Promethion All     HG002     HG003     HG004 -
BGI BGISEQ500 All     HG002 -
BGI MGISEQ PCR-free All     HG002 -
BGI stLFR All     HG002     HG003     HG004 All:bwamem:hg19     HG002     HG003     HG004
Strand-Seq HG002 by BCCRC All     HG002 -

* CompleteGenomics LFR raw or alignment data not available, but analysis results available under: https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/CompleteGenomics_newLFR_CGAtools_06122015/


ChineseTrio

Son:HG005     https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG005_NA24631_son/
Father:HG006     https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG006_NA24694-huCA017E_father/
Mother:HG007     https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/

Sequencing Platform Sequence Alignment
Illumina WGS 2x250bp 300X for son;
2x150bp 100x for parents
All     HG005     HG006     HG007 novoalign:   All:hg19-hg38     HG005:hg19-hg38     HG006:hg19-hg38     HG007:hg19-hg38
Illumina 6KB Matepair All     HG005     HG006     HG007
Moleculo All     HG005     HG006     HG007
SOLiD 60x for son All:xsq     HG005:xsq LifeScope:   All:hg19     HG005:hg19
CompleteGenomics CGAtools: All:hg19 (RMDNA)     HG005:hg19     HG006:hg19     HG007:hg19
CGAtools: All:hg19 (cellsDNA)     HG005:hg19
Illumina Whole Exome bwamem:   All:hg19     HG005:hg19
Ion Proton 1000x Exome TMAP:   All:hg19     HG005:hg19
BioNano for son All:bnx     HG005:bnx All:hg19 (cmap)     HG005:hg19 (cmap)
PacBio Sequel for the trio All     HG005     HG006     HG007
PacBio SequelII CCS 11kb
BGI BGISEQ500, MGISEQ, stLFR


NA12878

NA12878:HG001     https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/

Sequencing Platform Sequence Alignment
Illumina WGS 2x150bp 300X HG001 bwamem:   HG001:hg19 (downsampled30x)
novoalign:   HG001
Illumina HiSeq Exome HG001
HG001:trimmed_fastq
bwamem:   HG001:hg19
Illumina TruSeq Exome bwamem:   HG001:hg19
10X Genomics bwamem:   HG001:hg19
bwamem:   HG001:hg19 (size_selected)
10X Genomics ChromiumGenome LongRanger2.0:   HG001:hg19-hg38
LongRanger2.1:   HG001:hg19-hg38
CompleteGenomics CGAtools:   HG001:hg19
Ion Proton 1000x Exome TMAP:   HG001:hg19
NA12878 SOLiD5500W LifeScope:   HG001:hg19
BGI BGISEQ500, MGISEQ, stLFR
PacBio 40x HG001:hdf5
PacBio SequelII CCS 11kb
Ultralong_OxfordNanopore -
minimap2:   HG001



Please Note:
1. If you want to use raw sequencing data (fastq, fasta, hdf5, xsq, bnx etc) for your analysis, then you can use the sequence.index.* files when you need to download the data.
2. If you want to use aligned data (bam, xmap/cmap etc.) for your analysis, then you can use the alignment.index.* files when you need to download the data.

giab_data_indexes's People

Contributors

chunlinxiao avatar jzook avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

giab_data_indexes's Issues

Data descriptions for newer data

Hi,

Data descriptions for some of the raw data is greatly appreciated. Specifically, I am looking for information regarding

  1. Alignment method/any error correction used over raw subreads for alignments in ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/Baylor_NGMLR_bam_GRCh37/
  2. Description of sequencing method used for ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/. Is it the same as that for HG002 (which is described in the original publication - https://www.nature.com/articles/sdata201625)?

Thanks!

trying to download chr22 subset with samtools

I am trying to get only chr22 reads from the NIST_NA12878_HG001_HiSeq_300x to build material for a training (the file has a bai index next to it and the command below runs). The 30x downscale data present there is a bit too small for my aim.

samtools view -b -h ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.hs37d5.300x.bam 22:0-50818468 > HG001.hs37d5.300x_chr22ss.bam

I irreproductively get a file of 5.6 to 5.9GB depending on the attempt which is too small to be the whole 300x chr22 subset (2% of 550GB should be more like 11GB). The records are OK and all from '22' but I fear they are only teh first part of the real data. I have the feeling that some timeout occurs here. I tried curl piped to samtools but this one fails because the access to the bai is not possible.

Can someone confirm that the is a problem with samtools on that link or wether this is OK?
Any other alternative to downloading the 550GB to subset locally?

Thanks

Locations of Oxford Nanopore fast5 data?

I see you now have PromethION data for several of the individuals, available in fastq format. However, I don't see a link to the raw fast5. Is this available somewhere?

MD5 checksum value error

I just downloaded the Illumina WES data from the Ashkenazi trio:
https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015

The link above has some errors in the MD5 checksum data presented on the GIAB github table. Minor, but thought worth drawing to your attention, if you wanted to verify (and correct?)

On the GIAB github page:

  1. the bam & bai file for HG002 currently both show the same MD5 hash (= c80f0cab24bfaa504393457b8f7191fa).
    • In my download, that hash matches the .bam file, but the .bai file comes up as d4fea426c3e2e9a71bb92e6526b4df6f
  2. the bai file for HG004 shows MD5 hash = 8914bfb6fa6bd304192f2c9e13e903f.
    • In my download, the hash comes up as 8914bfb6fa6bd304192f2c9e13e903f4. Close enough to guess that the website text was probably inadvertently truncated by a digit, rather than an actual mismatch.

De novo mutations

Hi,
Where can I find the de novo mutations identified for both the Ashkenazi and Chines trios?
thank you in advance :)

EDIT: I mean the 2,502 variants for the Ashkenazi trio and the 821 variants indicated in the paragraph "High Mendelian consistency in trios" of Wagner et al.

Truth_set information for benchmarking

Hi,
I'm currently working on benchmarking VCF files generated from HG002_data(test_run just one sample) for SV calling(Manta, lumpy, GRIDSS, nf-core/sarek) against a truth set. I aligned the BAM files using GRCh38. Any ideas on how to effectively benchmark my results on which truth set? I have one confusion can I use the truth_sets from SV_0.6/ for bench the vcf_files(aligned on GRCh38) generated from SV caller tools. I am using truvari and SVanalyzer for benchmarking.
Thank you.

paired reads have different names

Hi, I am trying to run the alignment using bwa mem for the 2 files "U0a_CGATGT_L001_R1_001.fastq.gz" "U0a_CGATGT_L001_R2_001.fastq.gz" I already got from the FTP site with the reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz" and the command I am using is
bwa mem -t 16 -R '@RG\tID:H814YADXX.5.CGATGT.1101\tSM:HG001\tPL:illumina' GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz U0a_CGATGT_L001_R1_001.fastq.gz U0a_CGATGT_L001_R2_001.fastq.gz | samtools view -b - >HG001.GRCh38_no_alt_analysis_set.bam

but I am getting an error with the sequence headers:

[mem_sam_pe]` paired reads have different names: "HWACAGATTTTGT", "HWI-D00360:5:H814YADXX:1:1102:11719:83283"
[mem_sam_pe] paired reads have different names: "HWACTATTDDD", "HWI-D00360:5:H814YADXX:1:1102:11293:83492"
[mem_sam_pe] paired reads have different names: "@@faaa(+:A0&AA", "HWI-D00360:5:H814YADXX:1:1102:11730:83321"
[mem_sam_pe] paired reads have different names: "HWI-A@HWI-D00360:5:H814YX:1:1102:10399:83348", "HWI-00360:5:H814YADXX:1:1102:11699:83300"
[mem_sam_pe] paired reads have different names: "ACD00360TJJJJC@AGCCCTGCACCACCTAATAAGAACTGGAAAGTCEEDDDDDDDD", "HWI-D00360:5:H814YADXX:1:1102:11719:83361"
[mem_sam_pe] paired reads have different names: "HWCTAAAATC:BDDDDFDDDDDDCEDDDHJJEHIIIJJJHHH>HFFEEEEET:83ACDDDDTAAATTEDDDDDDEDDDDJJFHJJJJJJJJJJJJJJJJJJJJJJIJJJJJJ@T4BJJJTTATCTTG>FGGCAGGCTJJIJJJJJEDEECDDFAAGTAAADDDDDDDCTCTTCTTGTTTTCCCC>AGCC60:5:HC814YJDDDCCDDIGCCCTTC1IIIIHIEDDD@FFFCTTC1IIIIHIEDCCC;>CC60:5:H:0:CGADXX:1:1ATGTTTA:N:0:CGAC>CGAC>CG3AGGCTGAGGYADXX:JJJJJJJJIJJA0360GAIAGEEDEEEEC:GJIIJJJC:0:CGATGIFFFHHHHHJJJJJDEDDDDDGDEDDDDGTTTTTAT@HHJJJTGT", "HWI-D00360:5:H814YADXX:1:1102:11549:83491"
[mem_sam_pe] paired reads have different names: "HWCATCCTCCCAAGACTAADD@FFFC99:833C99:833CGCTTTGFHH@FFFFDDDCCCDCFB:>CA8>A??CC:A:ACTTACTCAAAAAACTATH814CAAATGCAGDDD:TTAAGTTCACAGCGA8DEDDDDDGJJJJJJDDDDDBDDDDDDDDDDDDDTGGACTTTJJHHHF60:5:HHH@FFFGTGGCAGGCTCCTGTAACGDDDDDDDDATGAACTCIACTAGDDDBBDDG9ATGGAATTTGACTTGADXX:1CACCTGCCAAACATACCCGTCTTTACC(G36CAGACCACCTGGACTTCCAGGEECDCDCDGAGGCCTGGCCATGTTATATGAAGTGIDXX:1CACCTGCCAAACATACCCGT", "HWI-D00360:5:H814YADXX:1:1102:11746:83407"
[mem_sam_pe] paired reads have different names: "HWACTATTDEFFFHHHHCCTTGTGTE:@DDDD49?IJJIGIG83407", "HWI-D00360:5:H814YADXX:1:1102:11545:83354"

I have tried to sorting the 2 files using fastq-sort but still getting the same error, anyone can help ?

Strong coverage deviation for 1 of 13 subdirectories of NA12878 Illumina 300x WGS

Dear GIAB team,
while doing some k-mer counting using the files indexed at sequence.index.NA12878_Illumina300X_wgs_09252015, I noticed, that the total number of 25-mers in all *.fastq.gz files in 140115_D00360_0010_BH894YADXX/ (hereinafter referred to as subdirectory 0010) significantly differs from all other subdirectories at NIST_NA12878_HG001_HiSeq_300x/ (005 to 009, 0011 to 0017).

Subdirectory 0010 contains only 4,185,958,248 (N-free) 25-mers, whereas all other 12 subdirectories (005 to 009, 0011 to 0017) contain between 56,877,996,538 and 69,240,304,680 25-mers each. This difference can also be seen in the number of files and the sum of the file sizes.
KMC3 outputs the same numbers of total k-mers per subfolder.

How does the low number of 25-mers in subdirectory 0010 fit to the quote "The other folders each contain ~20-30x sequencing total (a single flow cell)" in the README file?

Are you aware of this clear deviation for subdirectory 0010?
Have you discussed the possible causes of this outlier subdirectory in any of your publications, which I may have missed?
Can you rule out that this strong deviation for 0010 could possibly have negative effects on the whole data set?

Thanks in advance,
Jens

SRX accessions

Dear GIAB team

Thanks for gathering all the links here.
It would be super helpful if the SRX accessions are also provided here.

Thank you in advance.
Sina

HG001 - Incorrect directory name for one of HiSeq 300X libraries

It seems the directories are named based on the sample libraries.
Example, all FASTQs in dir 140127_D00360_0011_AHGV6ADXX are from library H8GV6ADXX.

https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140127_D00360_0011_AHGV6ADXX/

Minor thing but it looks like this dir has been incorrectly named, perhaps due to a typo!
It should be 140127_D00360_0011_AH8GV6ADXX not 140127_D00360_0011_AHGV6ADXX. Rest of the library directories for HG001 are all named consistently based on the library.

PromethION data de novo assembly?

Hi, in the README for the ONT-PromethION datasets (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/) under 'Data Processing Methods', alignment of the called reads is mentioned. But in the linked paper (https://doi.org/10.1101/715722) it's explained that the dataset was assembled de novo, only doing alignment afterward for benchmarking (unless I'm misunderstanding).

Also in the README under the 'Data Processing Methods'-header, a newer version of Guppy is mentioned than the one used in the paper, which suggests to me that that part was added more recently, and it contains no information on assembly at all. Does that mean that newer versions of the data are no longer generated de novo?

filedate header in recent benchmark .vcf files is from 2016

Hi!
I'm trying to use the HG002 benchmark .vcf as a truth set for my variant calling work (after seeing it mentioned in https://doi.org/10.1038/s41587-020-0538-8). When I downloaded the latest version from https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/ I noticed the fileDate in the header was set to '20160824', despite reading about versions from 2020/2021. Am I missing something obvious here or is the header incorrect?

Passage of GIAB samples

Hi,

I'm currently working with fastq files of the HG002 and HG005 cell lines, sequenced using the Illumina 300x platform. These files were downloaded from the following links:
I am using fastq files downloaded from here:

I wonder what is the passage of these cell lines. My research involves analyzing mitochondrial variants in these samples. It's crucial to know whether these samples are from early or late passages. If there's any resource or information available regarding the passage number of these cell lines, would you please share it?

Thank you!

multiple primary records for same read group RMNISTHS_30xdownsample.bam

Hello,

I'm trying to test out a variant calling pipeline using the GIAB BAM file downloaded from the ftp server (/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam), but am getting the following error:

Fatal error: Assertion failed in ../src/host/dragen_api/bam2dbam_transformer.cpp line 445 -- false -- There are multiple input primary records for read HWI-D00360:5:H814YADXX:2:2215:17273:66909, in the same read group. This is a violation of the BAM standard, which indicates that if two records have matching QNAME, they should be construed as deriving from the same template. Perhaps there was an error in setting up the read groups during BAM creation.

I saw a previous issue for a separate file where bams were merged improperly. Could that be happening here? Thanks!

Nate

HG001_bam files information

Hello, I'm attempting to convert the HG001 BAM files to FASTQ format for benchmarking purposes. However, I'm facing challenges aligning the generated FASTQ files to my reference genome. Could you provide information on the reference genome used to create these BAM files?
Screenshot from 2023-12-13 10-01-13

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam

Incorrect link to PacBio CLR data for HG004

This is a minor issue. It looks like the link listed in the main table for the PacBio CLR data of HG004 points to the wrong location.

The link points to: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_subreads_fasta_10082018.HG004

and this file is not present. There is a link in the manifest that appears to be correct for HG004 at the following location:

https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_subreads_fasta_10052018.HG004

It may be that this link just needs to be updated.

MD5 checksums don't match for `HG002 Illumina 2x150bp`

I've tried downloading a few FASTQ files listed in sequence.index.AJtrio_Illumina300X_wgs_07292015.HG002 and found the MD5 checksums listed here don't match with the downloaded files.

  • No download errors
  • Tried multiple times
  • Verified random files
  • Checksums still don't match

commands used:

$ wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/Project_RM8391_RM8392/Sample_2A1/2A1_CGATGT_L001_R1_001.fastq.gz

$ md5sum 2A1_CGATGT_L001_R1_001.fastq.gz
c2ae5e412fb211974f9a9a46a5392428  2A1_CGATGT_L001_R1_001.fastq.gz

MD5 checksum listed for the same file from the same library is 48e52acfce7548bddad2b3f89e8e0348

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/Project_RM8391_RM8392/Sample_2A1/2A1_CGATGT_L001_R1_001.fastq.gz 48e52acfce7548bddad2b3f89e8e0348 ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/Project_RM8391_RM8392/Sample_2A1/2A1_CGATGT_L001_R2_001.fastq.gz bd37bc5dedb31845361f531803ee03b5 HG002

Can you please verify this?

Best,
Faizal

HG002 2x250 BAMs are double-covered by identical reads

The following BAM file for HG002:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.hs37d5.2x250.bam

...seems to erroneously contain 2 copies of every read pair. For instance a simple view of the bam shows:

D00360:97:H2YVMBCXX:2:1107:18923:87587  163     1       10114   6       42M2S   =       10407   337     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCTT    DDDDDIIIIIIIIIIIIIHIIIIIHHIHIIIIIIII=<G?HI11    PG:Z:novoalign  AS:i:21 UQ:i:21 NM:i:0  MD:Z:42 PQ:i:22 SM:i:0  AM:i:0
D00360:97:H2YVMBCXX:2:1107:18923:87587  163     1       10114   6       42M2S   =       10407   337     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCTT    DDDDDIIIIIIIIIIIIIHIIIIIHHIHIIIIIIII=<G?HI11    PG:Z:novoalign  AS:i:21 UQ:i:21 NM:i:0  MD:Z:42 PQ:i:22 SM:i:0  AM:i:0
D00

...and so on for every read.

PacBio data download failed

Hello,
I want to download the PacBio bam.
for example:
wget ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/CSHL_bwamem_bam_GRCh37/BWA-MEM_Chr10_HG002_merged_11_12.sort.bam
but I failed.
image
Could you tell me the correct download link?

Add ID column to SV BED file

Hi,

I constantly make use of the GIAB SV callset and really appreciate the effort of curating all of these.

I do have one feature request:

The SV BED file right now contains only the coordinates but not the type of variant the interval is associated with, or the originating variant ID available from the VCF (in HG19).
An IGV trick that I constantly use is packing some information—that I want to quickly get for the variant—from the source VCF into the ID (4th) column of the BED file, which will be displayed by IGV. This way one doesn't need to click on a VCF record just for a quick glance.

I'd appreciate it if the VCF ID records are copied into the BED file.

Thank you!
Steve

Bam index files for HG002 not working?

Hello. I'm trying to download a subset of data from HG002 and parents. I'm using the command samtools view -bh -o HG002_20.bam ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.hs37d5.2x250.bam 20, which should save chr 20 in a bam file for me. However, I get the following error:
[E::idx_test_and_fetch] Error reading "ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.hs37d5.2x250.bam.bai" [1] 6864 segmentation fault (core dumped) samtools view -bh -o HG002_20.bam 20

I have the same issue when I try and download the reads which used GRCh38 as reference.

Note that the above command works fine for the mother and father's reads.

Could it be that when the BAM files were re-uploaded in 2019, they were not re-indexed?

I've tried this with samtools 1.10 and samtools 1.9 and both give errors.

raw data between 30X and 40X and their truth VCF

Dear GIAB team,

I just discovered these wonderful data (singleton and trio) to evaluate my variant call pipeline.

Maybe a stupid request, would it be possible to send me a raw data link (ilumina fastq) with coverage between 30X and 40X and their truth vcf results done on hg38. My human genome WGS samples having a coverage between 30x and 40x that's why I am requesting the raw data from this coverage.
I also wanted to know if I was using the reference hg38.fasta from UCSC will not affect the comparison with vcf truth.

I would be very grateful and Thanks in advance !

NA12878 TruSeq references

Hello!
I tried to check HsMetrics for NA12878 TruSeq bam and could not find any info about fasta reference used in alignment.
I searched through GIAB website, but it seems the info was updated or deleted. FAQ as well as ftp readme mention newer versions of hg37/19 that are not suitable for making an interval_list (in my case).
The bam file is NIST-hg001-7001-ready.bam and targets are TruSeq_exome_targeted_regions.hg19.bed, seems they have different chr namings?

Raw PacBio subreads data

Dear GIAB team,

Thanks for the wonderful data collection.
Maybe an inappropriate question, would it be possible to access the raw PacBio SequelII CCS 11kb data of NA12878 (or other samples), i.e., subreads.bam with IPD and other information in it?

Thank you in advance!

Best,
Peng

Request for Information on Original VCF for Lifted Over VCF

Hi GIAB folks!

I am currently working on a project and I have been using the NIST_SVs_Integration_v0.6 truth set (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/).

In order to ensure that I am using the correct HG38 version of this truth set, I have been searching for information on the original variant call format (VCF) used in the lifted over VCF file available at the URL https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/nstd175.GRCh38. Unfortunately, I was not able to find enough information on this in the README.md.

I would greatly appreciate it if you could provide any information you may have on the original VCF used in the lifted over VCF file.

Thanks,
Gao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.