Giter Site home page Giter Site logo

na12878's Introduction

Oxford Nanopore Human Reference Datasets

Quick Links

Data availability

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace http://s3.amazon.com/nanopore-human-wgs/ with s3://nanopore-human-wgs to download. For example, to download rel3-nanopore-wgs-288418386-FAB39088 to the current working directory use the following command.

aws s3 cp s3://nanopore-human-wgs/rel3-nanopore-wgs-288418386-FAB39088.fastq.gz .

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Contact

Please raise issues on this Github repository concerning this dataset.

History

* rel1: 1st December 2016. Initial release.
* rel2: 5th December 2016. 25 flowcells, 58958035887 bases, 9053909 reads
* rel3: 39 flowcells, 91240120433 bases, 14183584 reads
* rel4: added additional 14 flowcells, 23140190547 bases, 1415868 reads
* rel5 release: June 2018. All data basecalled with Guppy 0.3 (10kb chunk size)
* rel6 release: July 2019. All files converted to multi-fast5 format and basecalled with Guppy 2.3.8+HAC model.
* rel1 RNA: 30th November 2017. 30 flowcells (native RNA), 12 flowcells (1D cDNA)

na12878's People

Contributors

alexomics avatar hasindu2008 avatar mattloose avatar mitenjain avatar nickloman avatar paultsw avatar tomsasani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

na12878's Issues

Downloading human fast5 files from direct RNA data?

Hello,

I am totally new to the Nanopre sequencing field. Could you please direct me where I can download the raw fast5 files for direct RNA sequencing data of GM12878 human cell line? I am interested in looking at the ionic current profiles of human mRNAs.
Also is there a tutorial that you could direct me to get started?

Thank you for your help!
Aashiq

Which is the accuracy of canu.35x.contigs.polished2.fasta?

Hi,
first of all: very impressive job, congratulations!
My question is if you have tried to estimate also the accuracy of the canu.35x.contigs.polished2.fasta?
I guess it must be close to the 99.88% you provide for the 30X accuracy?

By the way, can you confirm which polishing steps were done on both the 30X and 35X assemblies:
in the Table 1 it says Pilon 2x, and indeed the contigs are called tigXXXXXXXX_pilon_pilon, but in section "Assembly of the dataset" you say: "We performed a de novo assembly of the 30ร— dataset with Canu and polished the assembly using both nanopore signal and Illumina data (Table 1). " Did you use also Nanopolish on the 30X/35X assemblies? I guess you only used it for the chr20 assemblies?

Thank you!
Fran

HMM from MarginAlign?

Would it be possible to include a model file produced by MarginAlign for subsequent realignment? Perhaps one for LAST and BWA?

FAST5 Directory Names

I have started to download some of the FAST5 tarballs and there appears to be an issue with the directory structure within each of the tarballs. Each of the tarballs I have untar'ed thus far creates the following directory structure "./rel3/${chr}/${fn}" (literal file name; those are not placeholder curly brackets). This is not a huge deal, but it does involve unzipping the files within separate directories to avoid incredibly large numbers of files within the directory containing all FAST5 files. I know this is a lot of work to re-upload, but thought it worth mentioning.

NA12878.hq.sv.vcf

Could you describe the included vcf file NA12878.hq.sv.vcf? Is this a lumpy SV callset using the ONT sequencing data? I'm not clear how this file corresponds to the manuscript. I'm interested in the SV callset from nanopore data that was described in the paper, and wonder if that can be made available.

Thank you,
Ted

Basecalling

Hi,
I was wondering what base caller you used for the reads? I heard that there is an improvement (scrappy) or so that works better on homopolymers. Is there a plan to redo the base calling for the whole genome?
Thanks
Fritz

Nanopore trancsriptome reads on UCSC genome browser

Hello.

I am currently using cDNA sequencing data, and when I tried to upload reads on ucsc, I found my reads representing the strand information in opposite way (I am very postive that all of the sequeces are showing the opposite direction). I am wondering if there's any specific options or steps needed to process the data.

I have generated my data straight out from the MINion, mapped using minimap2 without any filteirng. So I am wondering if there any filtering steps that you perform that is necessary for nanopore sequencing analysis? I was doing some research as well, but I am also hoping to get methods that you might be using.

Thank you for your time and generous help.

Unique ID Correspondence between reads in the BAM file and the fast5 files

Hi, I am trying to identify individual reads in the BAM files with its corresponding fast5 file.

To do this I am looking up the QNAME in each read and if it matches to an attribute in the fast5 file. My current hypothesis is that the QNAMES were taken from the run_id attribute in 'fast5file/UniqueGlobalKey/tracking_id/'.

As an example, a typical QNAME looks like this:

4868307d-8e09-4ab0-8df2-eaccbed96e73_Basecall_Alignment_template

while a typical run_id looks like this:

a4429838-103c-497f-a824-7dffa72cfd81

Could someone please confirm whether my hypothesis is true. If not, what do QNAMEs in the BAM files correspond to? Otherwise, what was the procedure for assigning QNAMEs to reads in the alignment files?

My end goal here is to map aligned reads to the fast5 files which they correspond to.

Note: the example id's above were both taken from chrm1, but of course not necessarily from the same overlapping region.

RNAtotal?

Have you tried to do the direct RNA protocole with total RNA as input?
And if it's possible , how much input total RNA have you used?

Figure 7 accuracy

Dear all,
first of all congratulations for your work developing such a nice nanopore technology and providing tools to manipulate them to the community. Thank you very much.
Finally, I would like to know how did you obtain the accuracy Figure 7 at your paper titled "Nanopore sequencing and assembly of a human genome with ultra-long reads" since I would like to get similar comparison from my data unpolishod and pilon-polished nanopore data. However, I have not been able to know which data should be taken into consideration to obtain such a useful graphics.
Any advice will be apreciated.
Thank you in advance.

Can researchers download and use NA12878 RNA nanopore datasets for their studies?

Hi,

Our lab is trying to use short-reads and long-reads to identify non-colinear RNAs. The long-read dataset can help us to understand the lengths of non-colinear RNAs. I would like to ask whether the NA12878 RNA nanopore datasets is open to researchers to use in their studies?

Sinercerly,

Chia-Ying Chen

postdoctoral fellow
Comparative and Evolutionary Genomics Lab
Genomics Research Center, Academia Sinica, Taiwan
TEL: +886-2787-1248
Email: [email protected]

rel4 fastq files

There are links to the rel3 fastq files on the GitHub site, but I don't see the rel4 fastq files. Would it be possible to add these links as well?

Thanks!
Justin

Read to Flowcell mapping

With the combined, updated Guppy calls of the full dataset, is there a way to separate the reads by flowcell? I'd like to use actual per-flowcell dataset to evaluate the nanopore data. The fastq file does contain "runid" values, but when I tried to separate the rel5 Guppy fastq file by runid, I'm finding at least over one thousand different run id values. And, I can't find any place where the flowcell ID is shown in these calls.

What is the best way to access the per-flowcell basecalls for the data (is there something better than downloading the Albacore calls, building the mapping of readId to flowcell, then splitting the Guppy calls that way)?

RNA read lengths

Hi,

Thank you for this great resource! I was playing around with the data and trying to get an idea about the alignment properties etc. In particular I was looking at this alignment file: http://s3.amazonaws.com/nanopore-human-wgs/rna/bamFiles/NA12878-DirectRNA.pass.dedup.NoU.fastq.hg38.minimap2.sorted.bam

Correct me if I'm wrong, but the left plot on Slide 17 (on read lengths) should correspond to this dataset? However, using pysam to parse the dataset, I was only able to see ~50,000 reads longer than 5,000 bases. Am I mixing the cDNA and RNA datasets somehow? Will be grateful for further guidance on this. Thank you.

unlisted flow cells

We downloaded part of this dataset (I think chr 20 only), and I am seeing some flow cell names which don't appear in the README: FAB38968, FAB42483, and FAB46682. (I'm also missing FAB42483 entirely.) What are these, and why aren't they listed in the README? Thanks!

Packing bug in chromosone Y

I've downloaded chromosome Y reads, part 1 (and only part) from the readme.

After untaring I have folder rel3/${chr}/${fn} which smells like unresolved environment variables, that is bug in generating script. Can I still trust the data obtained?

throughput variability

Awesome!

I'd be very interested in any observations on what's contributing to the GB variability. Looking forward to more information as the project matures. New to the game, but quite a few minIONs in proximity and a promethION is being patiently waited for.

Daryl

FAST5 missing

Hello @mitenjain ,

In order to use the nanopore reads on my research, I have been looking for the .FAST5 files (raw) of rel4 (ultra read set) of the NA12878 cell line. But I have been unable to locate them.
Are they available somewhere else?

Is there a possibility that maybe the re-basecalled files with the newest version of Guppy (v2.1.3) are available if not the raw ones? (i.e. the fastq files - basecalled with the newest version of Guppy (v 2.1.3)?)

Thank you,
Eleni

fast5's of some RNA reads corrupted?

Some of the fast5's of the RNA reads do not seem to be readable. I downloaded the reads of the 5 Bham runs (using wget, I don't know if that matters?). My python scripts (using h5py library) return "OSError: Unable to open file (file signature not found)". hdfview 2.13.0 also fails to open them (java.io.IOException: Unsupported fileformat). re-downloading them doesn't solve the issue and many other fast5's of the same set read just fine. I've attached a list of some reads for which this was the case, but there are more (the quick&dirty script that encountered them just wrote error messages to screen from where I copy-pasted this list). I'll try to get a complete list of all the reads that seem unreadable.

Are these actually corruped or am I doing something wrong?
na12878_rnaReads_failed.txt

availability of rel4 fast5 files

Hi all,

I found fastq files of rel4 are available, but fast5 files seem not. Would it be possible to make available the fast5 files of rel4?

Many thanks,
Yifan

PolyA signals identification

It is not clear to me how signals correspondent to polyA were identified from the raw signals.
Can you explain this a little bit?

Thanks very much.

Huanlee

Is there an order in the fast5 zipped files in each chromosome?

Hello,
The readme file states "Each complete 'part' contains 100,000 reads and should be roughly in sort order along the chromosome..."
Does that mean that the reads in each part are sorted? and each part contains reads aligning to the whole chromosome?

For example there are 9 parts of zipped fast5 files for chrom1.
Are the files in part 1 aligning to one end of the chromosome and files in part9 align to the other end?
Or both parts have reads from all over the chromosome but in each part - reads at the beginning are aligned to the start of the chromosome while reads at the end align to the end of the chromosome?

Thanks.

fatal error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)

Hi all,
When I run aws s3 cp s3://nanopore-human-wgs/rel3-nanopore-wgs-288418386-FAB39088.fastq.gz . (the provided example) I am getting a fatal error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)

and when I pass the flag --no-verify-ssl I am getting fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

I have already configured aws using "aws configure"
I am using a Windows machine, and using aws on the Command prompt.

Best,
Doruk

Estimation of poly(A) tail sizes

It's not clear to me how poly(A) tail sizes were estimated? Will you be updating this page with a list of scripts used for the analyses? It would be nice to try and reproduce some of the figures.

file size

Is it possible to see size of a separate data sets?

How can the dRNA and cDNA data be downloaded?

I know the genomic DNA sequence is in the ENA, but I want to take a look at the RNA data.

http://s3.amazon.com/nanopore-human-wgs/ just brings up the AWS management console.

I've tried the command as per the example:
aws s3 cp s3://nanopore-human-wgs/rel3-nanopore-wgs-288418386-FAB39088.fastq.gz .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Also tried:
aws s3 cp s3://nanopore-human-wgs/ .
fatal error: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
aws s3 cp s3://nanopore-human-wgs/* .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Is there a listing of files? Is there some other way to access this data?

Minimap parameters

Hi there,

Is there any chance you would be able to share your parameters used for Minimap for the ONT DRS reads?

Many thanks

Sample prep?

Any details on sample preparation (i.e. kit and flowcell versions?)

Wrong entry in Table at NA12878/RNA.md ?

Hi,
thanks for making all that great data available ahead of publishing. I was wondering whether the labelling of the 4th line in the first table (section Basecalls (Albacore 2.1)) in the file NA12878/RNA.md was correct. There are two rows (3rd and 4th row) with the FileType cDNA Pass. Is this correct? Also is the mean length of those correct?
Thanks for checking.
Kind regards,
Marcel

Availability of VCF file used

Hello!
I am particularly interested in knowing which variants on the transcripts were used to assign maternal or paternal haplotype. However the (phased) VCF file is not posted on the github and from the "online methods" it isn't completely clear how the variants were called and phased which makes it somewhat difficult to reproduce the results and find what I'm looking for...
Could the VCF used be posted on here as well? Or more clarification on how it was created?
Thanks in advance!

"Ultra" Kit

Thanks for the sharing the data. I got a question about the kit name, on the ref4, i.e., 'Ultra'. 1D Rapid and 1D ligation are quite common, but 'Ultra' does not seems to belong to this category. Do you have any idea on this ?
Thanks!

incomplete download links of fast5 files for 30x dataset

Hi all,

I found the links here https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md#fast5-signal-level-files seems to be incomplete. For example, the last part, part9 of chr1 aligns to the middle of chromosome 1. After investigating the urls (http://s3.amazonaws.com/nanopore-human-wgs/rel3-fast5-chr1.part??.tar), there seem to be 15 parts of chr1 but the table only shows 9 of them; and similar problem exists for other chromosomes.

Best,
Yifan

Adapter sequences in reads

Hi,

Would you please confirm whether there are adapter sequences in reads (cDNA and dRNA FastQ files) or they are trimmed? I assume you used Canu for assembly and maybe trimming is done there. However, I will be happy to get more information from you about it.

Regards,
Saber.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.