Giter Site home page Giter Site logo

ve-sequence-parsers's Introduction

THIS REPO HAS MOVED!

It now lives in our OSS monorepo - https://github.com/TeselaGen/tg-oss

It can now be installed as npm install @teselagen/bio-parsers

Bio Parsers

About this Repo

This repo contains a set of parsers to convert between datatypes through a generalized JSON format.

Exported Functions

Use the following exports to convert to a generalized JSON format:

fastaToJson //handles fasta files (.fa, .fasta)
genbankToJson //handles genbank files (.gb, .gbk)
ab1ToJson //handles .ab1 sequencing read files 
sbolXmlToJson //handles .sbol files
geneiousXmlToJson //handles .genious files
jbeiXmlToJson //handles jbei .seq or .xml files
snapgeneToJson //handles snapgene (.dna) files
anyToJson    //this handles any of the above file types based on file extension

Use the following exports to convert from a generalized JSON format back to a specific format:

jsonToGenbank
jsonToFasta
jsonToBed

Format Specification

The generalized JSON format looks like:

const generalizedJsonFormat = {
    "size": 25,
    "sequence": "asaasdgasdgasdgasdgasgdasgdasdgasdgasgdagasdgasdfasdfdfasdfa",
    "circular": true,
    "name": "pBbS8c-RFP",
    "description": "",
    "parts": [
      {
        "name": "part 1",
        "type": "CDS", //optional for parts
        "id": "092j92", //Must be a unique id. If no id is provided, we'll autogenerate one for you
        "start": 10, //0-based inclusive index
        "end": 30, //0-based inclusive index
        "strand": 1,
        "notes": {},
      }
    ],
    "primers": [
      {
        "name": "primer 1",
        "id": "092j92", //Must be a unique id. If no id is provided, we'll autogenerate one for you
        "start": 10, //0-based inclusive index
        "end": 30, //0-based inclusive index
        "strand": 1,
        "notes": {},
      }
    ],
    "features": [
        {
            "name": "anonymous feature",
            "type": "misc_feature",
            "id": "5590c1978979df000a4f02c7", //Must be a unique id. If no id is provided, we'll autogenerate one for you
            "start": 1,
            "end": 3,
            "strand": 1,
            "notes": {},
        },
        {
            "name": "coding region 1",
            "type": "CDS",
            "id": "5590c1d88979df000a4f02f5",
            "start": 12,
            "end": 9,
            "strand": -1,
            "notes": {},
        }
    ],
    //only if parsing in an ab1 file
    "chromatogramData": { 
      "aTrace": [], //same as cTrace but for a
      "tTrace": [], //same as cTrace but for t
      "gTrace": [], //same as cTrace but for g
      "cTrace": [0,0,0,1,3,5,11,24,56,68,54,30,21,3,1,4,1,0,0, ...etc], //heights of the curve spaced 1 per x position (aka if the cTrace.length === 1000, then the max basePos can be is 1000)
      "basePos": [33, 46, 55, ...etc], //x position of the bases (can be unevenly spaced)
      "baseCalls": ["A", "T", ...etc],
      "qualNums": [], //or undefined if no qualNums are detected on the file
    },
}

Usage

install

npm install -S bio-parsers

or

yarn add bio-parsers

or

use it from a script tag:

<script src="https://unpkg.com/bio-parsers/umd/bio-parsers.js"></script>
<script>
      async function main() {
        var jsonOutput = await window.bioParsers.genbankToJson(
          `LOCUS       kc2         108 bp    DNA     linear    01-NOV-2016
COMMENT             teselagen_unique_id: 581929a7bc6d3e00ac7394e8
FEATURES             Location/Qualifiers
     CDS             1..108
                     /label="GFPuv"
     misc_feature    61..108
                     /label="gly_ser_linker"
     bogus_dude      4..60
                     /label="ccmN_sig_pep"
     misc_feature    4..60
                     /label="ccmN_nterm_sig_pep"
                     /pragma="Teselagen_Part"
                     /preferred5PrimeOverhangs=""
                     /preferred3PrimeOverhangs=""
ORIGIN      
        1 atgaaggtct acggcaagga acagtttttg cggatgcgcc agagcatgtt ccccgatcgc
       61 ggtggcagtg gtagcgggag ctcgggtggc tcaggctctg ggg
//`
        );
        console.log('jsonOutput:', jsonOutput);
        var genbankString = window.bioParsers.jsonToGenbank(jsonOutput[0].parsedSequence);
        console.log(genbankString);
      }
      main();
</script>

see the ./umd_demo.html file for a full working example

jsonToGenbank (same interface as jsonToFasta)

//To go from json to genbank:
import { jsonToGenbank } from "bio-parsers"
//You can pass an optional options object as the second argument. Here are the defaults
const options = {
  isProtein: false, //by default the sequence will be parsed and validated as type DNA (unless U's instead of T's are found). If isProtein=true the sequence will be parsed and validated as a PROTEIN type (seqData.isProtein === true)
  guessIfProtein: false, //if true the parser will attempt to guess if the sequence is of type DNA or type PROTEIN (this will override the isProtein flag)
  guessIfProteinOptions: {
    threshold = 0.90, //percent of characters that must be DNA letters to be considered of type DNA
    dnaLetters = ['G', 'A', 'T', 'C'] //customizable set of letters to use as DNA 
  }, 
  inclusive1BasedStart: false //by default feature starts are parsed out as 0-based and inclusive 
  inclusive1BasedEnd: false //by default feature ends are parsed out as 0-based and inclusive 
  // Example:
  // 0123456
  // ATGAGAG
  // --fff--  (the feature covers GAG)
  // 0-based inclusive start:
  // feature.start = 2
  // 1-based inclusive start:
  // feature.start = 3
  // 0-based inclusive end:
  // feature.end = 4
  // 1-based inclusive end:
  // feature.end = 5
} 
const genbankString = jsonToGenbank(generalizedJsonFormat, options)

anyToJson (same interface as genbankToJson, fastaToJson, xxxxToJson) (async required)

import { anyToJson } from "bio-parsers"

//note, anyToJson should be called using an await to allow for file parsing to occur (if a file is being passed)
const results = await anyToJson(
  stringOrFile, //if ab1 files are being passed in you should pass files only, otherwise strings or files are fine as inputs
  options //options.fileName (eg "pBad.ab1" or "pCherry.fasta") is important to pass here in order for the parser to!
) 

//we always return an array of results because some files my contain multiple sequences 
results[0].success //either true or false 
results[0].messages //either an array of strings giving any warnings or errors generated during the parsing process
results[0].parsedSequence //this will be the generalized json format as specified above :)
//chromatogram data will be here (ab1 only): 
results[0].parsedSequence.chromatogramData 

Options (for anyToJson or xxxxToJson)

//You can pass an optional options object as the third argument. Here are the defaults
const options = {
  fileName: "example.gb", //the filename is used if none is found in the genbank           
  isProtein: false, //if you know that it is a protein string being parsed you can pass true here
  parseFastaAsCircular: false; //by default fasta files are parsed as linear sequences. You can change this by setting parseFastaAsCircular=true 
  //genbankToJson options only
  inclusive1BasedStart: false //by default feature starts are parsed out as 0-based and inclusive 
  inclusive1BasedEnd: false //by default feature ends are parsed out as 0-based and inclusive 
  acceptParts: true //by default features with a feature.notes.pragma[0] === "Teselagen_Part" are added to the sequenceData.parts array. Setting this to false will keep them as features instead
  // fastaToJson options only
  parseName: true //by default attempt to parse the name and description of sequence from the comment line. Setting this to false will keep the name unchanged with no description
}

ab1ToJson

import { ab1ToJson } from "bio-parsers"
const results = await ab1ToJson(
  //this can be either a browser file  <input type="file" id="input" multiple onchange="ab1ToJson(this.files[0])">
  // or a node file ab1ToJson(fs.readFileSync(path.join(__dirname, './testData/ab1/example1.ab1')));
  file, 
  options //options.fileName (eg "pBad.ab1" or "pCherry.fasta") is important to pass here in order for the parser to!
)

//we always return an array of results because some files my contain multiple sequences 
results[0].success //either true or false 
results[0].messages //either an array of strings giving any warnings or errors generated during the parsing process
results[0].parsedSequence //this will be the generalized json format as specified above :)
//chromatogram data will be here (ab1 only): 
results[0].parsedSequence.chromatogramData 

snapgeneToJson (.dna files)

import { snapgeneToJson } from "bio-parsers"
//file can be either a browser file  <input type="file" id="input" multiple onchange="snapgeneToJson(this.files[0])">
// or a node file snapgeneToJson(fs.readFileSync(path.join(__dirname, './testData/ab1/example1.ab1')));
const results = await snapgeneToJson(file,options)

genbankToJson

import { genbankToJson } from "bio-parsers"

const result = genbankToJson(string, options)

console.info(result)
// [
//     {
//         "messages": [
//             "Import Error: Illegal character(s) detected and removed from sequence. Allowed characters are: atgcyrswkmbvdhn",
//             "Invalid feature end:  1384 detected for Homo sapiens and set to 1",
//         ],
//         "success": true,
//         "parsedSequence": {
//             "features": [
//                 {
//                     "notes": {
//                         "organism": [
//                             "Homo sapiens"
//                         ],
//                         "db_xref": [
//                             "taxon:9606"
//                         ],
//                         "chromosome": [
//                             "17"
//                         ],
//                         "map": [
//                             "17q21"
//                         ]
//                     },
//                     "type": "source",
//                     "strand": 1,
//                     "name": "Homo sapiens",
//                     "start": 0,
//                     "end": 1
//                 }
//             ],
//             "name": "NP_003623",
//             "sequence": "gagaggggggttatccccccttcgtcagtcgatcgtaacgtatcagcagcgcgcgagattttctggcgcagtcag",
//             "circular": true,
//             "extraLines": [
//                 "DEFINITION  contactin-associated protein 1 precursor [Homo sapiens].",
//                 "ACCESSION   NP_003623",
//                 "VERSION     NP_003623.1  GI:4505463",
//                 "DBSOURCE    REFSEQ: accession NM_003632.2",
//                 "KEYWORDS    RefSeq."
//             ],
//             "type": "DNA",
//             "size": 925
//         }
//     }
// ]

You can see more examples by looking at the tests.

Editing This Repo

All collaborators:

Edit/create a new file and update/add any relevant tests. Make sure they pass by running yarn test

Debug

yarn test-debug

Updating this repo

Teselagen collaborators

Commit and push all changes Sign into npm using the teselagen npm account (npm whoami)

npm version patch|minor|major
npm publish

Outside collaborators

fork and pull request please :)

Thanks/Collaborators

ve-sequence-parsers's People

Contributors

andresprez avatar benjamin-lee avatar ccwilson avatar dependabot-preview[bot] avatar djriffle avatar griggol avatar kcrafty avatar kimberley23 avatar linediconsine avatar rpavez avatar samdenicola avatar tgreen7 avatar thomas1664 avatar tiffanydai avatar tnrich avatar xinggao-pki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ve-sequence-parsers's Issues

'Buffer is not defined' error in snapgeneToJson.js

Hi Thomas (@tnrich),

I ran into a problem when using the snapgeneToJson.js parser.
In particular the error is Uncaught (in promise) ReferenceError: Buffer is not defined.
This seems to be associated to an update (it worked before ๐Ÿ˜‰) of the dependencies that are used in the project but I could not identify the specific package so far. It could however locate the code that triggers the error.

In an attempt to make it work again I came across 'The buffer module from node.js, for the browser'. If I add it to the dependencies and modify snapgeneToJson.js slightly it all works again. I simply added the two lines below as suggested by the buffer repository:

diff --git a/snapgeneToJson.js b/snapgeneToJson.js
--- a/snapgeneToJson.js
+++ b/snapgeneToJson.js
@@ -11,6 +11,8 @@ import createInitialSequence from './utils/createInitialSequence';
 import validateSequenceArray from './utils/validateSequenceArray';
 import flattenSequenceArray from './utils/flattenSequenceArray';

+var Buffer = require('buffer/').Buffer
+
 async function snapgeneToJson(
   fileObj,
   onFileParsedUnwrapped,

Maybe it should be extended similarly as done somewhere else.
Or maybe there is another solution that you can suggest. If not, would you be willing to merge a PR with this changes?

Cheers,
Marcel

problems using sbolXmlToJson on synbiohub files

@tnrich

Hi all, I am learning how to use sbolXmlToJson,
I downloaded multiple Sbol XML files from synbiohub.org and SBOL specification docs

I created a repl online playground when you can see my test here

This is the options I used ( I also try with no options ):

const options = {
    fileName: "example.gb",
    isProtein: false,
    parseFastaAsCircular: false,
    inclusive1BasedStart: false ,
    inclusive1BasedEnd: false,
    acceptParts: true
}

and this is the output I have:

./references/B11_simple.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B12.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B13.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B14.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B14.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B15.xml 

[ { success: false, messages: 'Error parsing XML to JSON' } ]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/BBa_F2620.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/BBa_J23100.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B0057.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B1011.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B3101.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B1011.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32930_encodes_BO_26630.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32932_encodes_BO_26635.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32965_encodes_BO_26967.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//
Hint: hit control+c anytime to enter REPL.

 ./references/B11_simple.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B12.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B13.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B14.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B14.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/B15.xml 

[ { success: false, messages: 'Error parsing XML to JSON' } ]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/BBa_F2620.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/BBa_J23100.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B0057.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B1011.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B3101.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/iGem_Collection/BBa_B1011.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32930_encodes_BO_26630.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32932_encodes_BO_26635.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32965_encodes_BO_26967.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32977_encodes_BO_26786.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

 ./references/Bacillus_subtilis_Collection/module_BO_32979_encodes_BO_26787.xml 

[
  { success: false, messages: 'XML is not valid Jbei or Sbol format' }
]
>Untitled Sequence||0|linear

LOCUS       Untitled_Sequence           0 bp    DNA     linear   SYN 08-MAR-2022
ORIGIN      
//

Can you please help me here?

is possible a JS?

@tnrich
Dears,
I am not an advanced web developer and my efforts on import the file as you describe is not working in my case. I think neither most of biologist does not have the skills to do it. Could you consider JS version to import it in our files?.

thanks in advance

Genbank file DEFINITION field should populate description

The DEFINITION line in a genbank file should be read as the description in the JSON.
Perhaps there are other fields where the description might come from but this is the standard place it seems:
https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#DefinitionB

E.g. your test file:
https://github.com/TeselaGen/ve-sequence-parsers/blob/master/src/test/testData/genbank/testGenbankFile.gb
has:
_DEFINITION promoter seq from pBAD33.

So I think this definitely needs fixing - WDYT @tnrich ?
I could probably fix it and raise a PR, just not sure when I will get to it ... we're coming back to this and picking up a new bio-parsers version in a few weeks, but maybe you'll be able to take a look before that.

Thanks!

Browserfied version

Hi Team Teselagen,

I've stumbled across your repos while looking for genbank parsers and I am very impressed. Great work.

I would be interested in using this library in pure browser (not node) application. Is there any chance that you have these libraries in browserified format (preprocessed; minimized)? If not, I believe this would be a useful format if you have the time/bandwidth.

Thanks,
zach cp

Ambiguous AA handling X vs - chars

Hi @tnrich ,

I noticed that Ambiguous AA 'X' are not rendered.

image

I used this fix. But may not work where proteinAlphabet["-"] is used.

  if (threeLetterSequenceStringToAminoAcidMap[sequenceString]) {
    return threeLetterSequenceStringToAminoAcidMap[sequenceString];
  } else  {
    return proteinAlphabet["X"];
  } 
//else {
//    return proteinAlphabet["-"]; //return a gap/undefined character
 // }

Originally posted by @kelmazouari in #140 (comment)

[Q] How to check whether the fasta sequence is protein or nucleic?

Hi. Sorry for posting the question here,didn't find another good place to ask.
validateSequence accepts isProtein option to validate sequence characters properly:

if (isProtein) {
    ...
 } else {
    ...
 }

I'm wondered, is there strong way to check whether the fasta sequence is protein or nucleic by looking at it to provide valid isProtein option to this method?

For example I've got the following fasta sequence and I need to know either this is protein or not:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Any help will be appreciated!

Suggestion: don't try to parse FASTA header lines (or make an option to turn it off)

@tnrich

Right now, when parsing a standard NCBI header line such as gb|M73307|AGMA13GT , the sequence "name" is set as the first thing before the pipe: gb. This is due to this function:

function parseTitle(line) {
const pipeIndex = line.indexOf("|");
if (pipeIndex > -1) {
result.parsedSequence.name = line.slice(1, pipeIndex);
result.parsedSequence.description = line.slice(pipeIndex + 1);
} else {
result.parsedSequence.name = line.slice(1);
}
}

Given the fact that FASTA headers are completely inconsistent and a mess, I would suggest skipping header parsing altogether or at least making it able to be disabled. If feeling fancy, one could pass a parser function that would return {name, description} so the user could decide.

sync code wrapped in promise

I'm using genbankToJson and the function does not seem to have any async code (all callbacks passed are executed synchronously)

Therefore I don't understand why the method doesn't just return the parsed genbank synchronously, instead of wrapping it in a Promise or letting the user provide a callback.

Sequence type parsed from a genbank file to be more concrete

Hi there and thank you for this library.

When I'm parsing genbank file (e.g. this one), returned sequence type is DNA, but I need it to be a bit more concrete as it's in the LOCUS field of the file, in the file example - ds-DNA.
From what I've checked the source at a glance, it looks like the sequence type is not parsed from THE field, but it's determined by the logic.

It would be great if we could support parsing it from THE field. WDYT @tnrich ?

Thank you.

Understanding package releases

@tnrich

Hello!

First off, thank you for your excellent work on this package.

Do you make any release notes available for published package versions? The changelog isn't current and I can't find other resources for understanding the implications of updating to the latest version.

I don't have a background in bio so any help understanding the commit history would be greatly appreciated.

genbankToJson parser crops any feature string containing a "=".

Hi there and thank you for this library. I am parsing Genbanks with it and I noticed that every label containing a "=" gets cut out after the "=". I believe this is because of the following line, which simply takes the second argument (but not any subsequence element) after the line split:

arr = line.split(/=/);
return arr[1];

I believe replacing return arr[1] by return arr.slice(1).join('') would solve the problem.

package is very big

Hello,

First thank for putting out this useful library on a permissive license.

I'm packaging this library for usage in a web application. I find that because of its dependencies (lodash, xml2js...) the bundle ends up taking almost 1MB after minification just because of bio-parsers.

I think this could be improved quiet easily.

  1. Lodash seems superfluous. Only a few methods like each are used, which can be easily replaced by vanilla javascript.

  2. The package could be splitted in several. This way if I need the genbank parser I don't have the xml2js package bundled with it, since it's not used for that part.

I forked the project and I could work on a PR if you are interested in those improvements.

jsontogenbank error

@tnrich
First I saved the module of jsontogenbank as json2gbk.js
https://github.com/TeselaGen/ve-sequence-parsers#jsontogenbank-same-interface-as-jsontofasta-no-async-required

node json2gbk.js
Then I launch the program, I got following errors.

/panfs/biopan01/scratch-147760/147760/BH_data/json2gbk.js:4
const jsonToGenbank = require('bio-parsers/parsers/jsonToGenbank');
^

SyntaxError: Identifier 'jsonToGenbank' has already been declared
at createScript (vm.js:56:10)
at Object.runInThisContext (vm.js:97:10)
at Module._compile (module.js:549:28)
at Object.Module._extensions..js (module.js:586:10)
at Module.load (module.js:494:32)
at tryModuleLoad (module.js:453:12)
at Function.Module._load (module.js:445:3)
at Module.runMain (module.js:611:10)
at run (bootstrap_node.js:394:7)
at startup (bootstrap_node.js:160:9)

lint-staged should be in devDependencies?

Hi -
I'm running some tools to list our thrid party code and a whole load of npm dependencies seem to be coming through from you guys - way more than before. They don't get seem to get bundled via our use of the lib via npm but are listed by the tooling.
I think it is because "lint-staged" was added as standard dependency but should be in devDependencies ?
Could you take a look please @tnrich ? Let me know if you need more info/help.
Otherwise we've been happily using this - so thanks!

V_region is not parsed

V_region is not parsed because there is a space in the end of it in parsers/utils/GenbankFeatureTypes.js

Currectly V_region is being parsed as Misc. Feature

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.