Giter Site home page Giter Site logo

halo_nr_tx's Introduction

Generating a non-redundant transcriptome for Halobacterium salinarum

Some of Halobacterium salinarum NRC-1 genes are repeated. Reducing it to a non-redundant set of genes can help simplifying analyses. Moreover, there have been several annotation efforts, causing a lot of pain to integrate findings from distinct studies due to non-standardized locus tags.

This is an attempt to create a non-redundant (pseudo)transcriptome for Halobacterium salinarum NRC-1 using as source the annotation released by Pfeiffer et al. 2019. A dictionary containing alternative locus_tag was also created using annotation provided by NCBI's RefSeq prokaryotic automatic annotation pipeline and a custom in-house file derived from the original annotation provided by Ng et al. 2000.

Briefly:

1.1) Pfeiffer et al. 2019 third party annotation was manually downloaded from:

1.2) Files were renamed and concatenated;
1.3) Accession names where replaced in GFF3 file to match those of NCBI RefSeq;

2.1) NCBI RefSeq annotations and genomes (ASM680v1 and ASM6902v1) were automatically downloaded and minimally parsed from the ftp resource;

3.1) Ng et al. 2000 derived annotation containing protein sequences was loaded and parsed;

4.1) For CDS, nucleotide sequences were obtained using the genome and both annotations, and converted to amino acid using the translation table #11. The aa sequences were further clustered using CD-HIT (95% identity) to obtain the non-redundant set;
4.2) For RNAs, nucleotide sequences were obtained using the genome and both annotations. The nt sequences were further clustered using CD-HIT (99% identity) to obtain the non-redundant set;
4.3) A cluster of a single locus_tag, coming from NCBI will be automatically discarded;

5.1) Finally, it is going to knit a Rmd file containing the dictionary table;

Please, check the final report here.

If you use the content of this repository, please, reference the following study:

Lorenzetti, A.P.R., Kusebauch, U., Zaramela, L.S., Wu, W-J., de Almeida, J.P.P., Turkarslan, S., de Lomana, A.L.G., Gomes-Filho, J.V., Vêncio, R.Z.N., Moritz, R.L., Koide, T., and Baliga, N.S. (2023). A Genome-Scale Atlas Reveals Complex Interplay of Transcription and Translation in an Archaeon. mSystems, e00816-22.

halo_nr_tx's People

Contributors

alanlorenzetti avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.