Giter Site home page Giter Site logo

getting-data's Introduction

getting-data

Obtaining data sounds simple but rarely is (in my experience) so I've created a repository to remind me of the processes as this is something I don't do very often since I usually create my own data.

NCBI-SRA archive

NCBI stores raw DNA sequences in .sra format, which need to be converted into fastq for whole genome processing pipelines. I followed the SRA Toolkit documentation but only managed to create files that were binary. Having no clue what this was about I went in search of someone who had also experienced this and found a GREAT article by the Edwards Lab at SDSU https://edwards.sdsu.edu/research/fastq-dump/ that explains why the NCBI documentation is so bad and how to get the files you need. Thank you https://github.com/linsalrob/EdwardsLab!

Required parameters

  • Splitting reads

    --split-spot (splits spots into individual L and R reads, and puts them in the same singular file)

    --split-files (separates read into L and R ends, and puts FWD and REV reads in two separate files)

    --split-3 (separates read into L and R ends but if a L has not got a matching R, or vice versa, they will be put into a single file)

  • Readids

    -I | --readids (appends read ID after spot ID as 'accession.spot.readid' on defline -> ID.1 = FWD read and ID.2 REV read)

    NB. --readids breaks BWA in downstream applications so this parameter may need to be omitted if BWA is in your pipeline. See Original format for more details.

  • Technical sequences

    --skip-technical (dumps only biological reads)

  • Clipping

    -W | --clip (applies L and R clips to remove SRA tags used in the whole genome amplification)

  • Read filtering

    --read-filter (filters out reads recorded as N). Options are: pass | reject | criteria | redacted

  • Original format

    -F | --origfmt (retains formatting of original definition line)

    SRA archive rewrites definition line in the seq. to include SRA ID and length if this parameter is omitted - this is a problem for BWA alignment as it doesn't recognize the new format and errors out >> i.e. [mem_sam_pe] paired reads have different names: "SRR849970.3.1", "SRR849970.3.2")

Optional parameters

  • Sequence data formatting

    -B | --dumpbase (ensures that output is A, T, C and G instead of color space (used for SOLiD))

  • Output to a specific directory

    -O | --outdir

  • Compression

    --gzip

Example

cd ~/PATH/sratoolkit/bin

./fastq-dump < SRRxxxxxx > --outdir < ~/PATH > --gzip --skip-technical --readids --dumpbase --split-files --clip --origfmt --read-filter pass

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.