Giter Site home page Giter Site logo

fasta-parser's Introduction

fasta-parser

A lightweight Python module to read and write sequence records in FASTA format.

The design was inspired by the utility of BioPython’s SeqIO, which supports many sequence formats. This repo focuses only on FASTA records. It is faster than BioPython and can handle compressed FASTA files (gz, bz2, zip, zstandard).

Requirements

Python >= 3.8

Optional dependencies

Two dependencies can be installed later if needed:

  1. lz4 >= 4.0.2

    Install with: pip install lz4

  2. zstandard >= 0.18.0

    Install with pip install zstandard.

Quick Start

Typical usage is to read a FASTA file and loop over the record(s).

import fasta

for record in fasta.parse('test/test.fasta'):
    print(record.id, len(record), record.seq[:10], record.desc)

Output:

NP_002433.1  362   METDAPQPGL   RNA-binding protein Musashi homolog 1 [Homo sapiens]
ENO94161.1    79   MKLLISGLGP   RRM domain-containing RNA-binding protein
sequence     292   MKLSKIALMM

Usage

This module contains the Record class representing a FASTA sequence record and the parse() function to read FASTA records from a file.

Record object

Record is an object that contains information on a FASTA sequence record, including id, description, and the sequence itself.

import fasta

record = fasta.Record(
    id='NP_950171.2', 
    seq='MEEEAETEEQQRFSYQQRLKAAVHYTVGCLCEEVALDKEMQFSKQTIAAISELTFRQCENFAKDLEMFASICRKRQE',
    desc='APITD1-CORT protein isoform 2 [Homo sapiens]'
)

print(record.id)            # NP_950171.2
print(record.desc)          # APITD1-CORT protein isoform 2 [Homo sapiens]
print(record.seq)           # MEEEAE..
print(record.description)   # >NP_950171.2 G APITD1-CORT protein isoform 2 [Homo sapiens]
print(len(record))          # 77
print('EEEA' in record)     # True

By default, the sequence line in the FASTA format is wrapped to 70 characters. You can provide the line length. Use zero (or None) for no wrapping.

print(record)
# >NP_950171.2 APITD1-CORT protein isoform 2 [Homo sapiens]
# MEEEAETEEQQRFSYQQRLKAAVHYTVGCLCEEVALDKEMQFSKQTIAAISELTFRQCENFAKDLEMFAS
# ICRKRQE

print(record.format(wrap=30))
# >NP_001382951.1 G protein subunit gamma 5 [Homo sapiens]
# MEEEAETEEQQRFSYQQRLKAAVHYTVGCL
# CEEVALDKEMQFSKQTIAAISELTFRQCEN
# FAKDLEMFASICRKRQE

print(record.format(wrap=None))
# >NP_950171.2 APITD1-CORT protein isoform 2 [Homo sapiens]
# MEEEAETEEQQRFSYQQRLKAAVHYTVGCLCEEVALDKEMQFSKQTIAAISELTFRQCENFAKDLEMFASICRKRQE

parse

The parse() function is a generator to read FASTA records as Record objects one by one from a file (plain FASTA or compressed using gzip or bzip2). Because only one record is created at a time, very little memory is required.

import fasta

for record in fasta.parse('test/test.fasta.gz'):
    print(record.id)

For some tasks you may need to have a reusable access to the records. For this purpose, you can use the built-in Python list() function to turn the iterator into a list:

import fasta

records = list(fasta.parse('test/test.fasta.gz'))
print(records[0].id)   # First record
print(records[-1].id)  # Last record

Another common task is to index your records by sequence identifier. Use to_dict() to turn a Record iterator (or list) into a dictionary.

import fasta

records = fasta.to_dict(fasta.parse('test/test.fasta.gz'))
print(records['NP_002433.1'])   # Use any record id

Test

You can run tests to ensure that the module works as expected.

./test.py

License

GNU General Public License, version 3

fasta-parser's People

Contributors

aziele avatar phenolophthaleinum avatar avirfrog avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.