Giter Site home page Giter Site logo

amayagrawal / bigslice Goto Github PK

View Code? Open in Web Editor NEW

This project forked from medema-group/bigslice

0.0 0.0 0.0 4.64 MB

A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data

License: GNU Affero General Public License v3.0

Shell 0.04% JavaScript 2.04% Python 70.53% CSS 0.05% Jupyter Notebook 5.80% Dockerfile 0.09% Jinja 21.44%

bigslice's Introduction

BiG-SLiCE

Biosynthetic Gene clusters - Super Linear Clustering Engine

Quick start

  1. Make sure you have HMMer (version 3.2b1 or later) installed.
  2. Install BiG-SLiCE using pip:
  • from PyPI (stable)
user@local:~$ pip install bigslice
  • from source (bleeding edge)
user@local:~$ git clone [email protected]:medema-group/bigslice.git
user@local:~$ pip install ./bigslice/
  1. Fetch the latest HMM models (± 470MB gzipped):
user@local:~$ download_bigslice_hmmdb
  1. Check your installation:
user@local:~$ bigslice --version .
  1. Run BiG-SLiCE clustering analysis: (see wiki:Input folder on how to prepare the input folder)
user@local:~$ bigslice -i <input_folder> <output_folder>

For a "minimal" test run, you can use the example input folder that we provided.

!Important! Please read this note before taking results from BiG-SLiCE for your analysis.

Querying antiSMASH BGCs

Using the --query mode, you can perform a blazing-fast query of a putative BGC against the pre-processed set of Gene Cluster Family (GCF) models that BiG-SLiCE outputs (for example, you can use our pre-processed result on ~1.2M microbial BGCs from the NCBI database -- a 17GB zipped file download). You will get a ranked list of GCFs and BGCs similar to the BGC in question, which will help in determining the function and/or novelty of said BGC. To perform a GCF query, simply use:

user@local:~$ bigslice --query <antismash_output_folder> --n_ranks <int> <output_folder>

Which will perform a query analysis on the latest clustering result contained inside the output folder (see wiki: Program parameters for more advanced options). Top-(n_ranks) matching GCFs will be returned along with their similarity measurements. You can then view the query results using the user interactive output (see below).

Custom GenBank input

To perform GCF analyses on BGCs not covered by antiSMASH/MIBiG (i.e., from tools like ClusterFinder and DeepBGC, or BGCs with manually-refined cluster borders), you can use the converter script that we provided, which will take a (genome) GenBank file along with a comma-separated descriptor file for every BGCs to be generated (please see the example input files provided in the script's folder).

User Interactive output

BiG-SLiCE's output folder contains both the processed input data (in the form of an SQLite3 database file) and some scripts that power a mini web-app to visualize that data. To run this visualization engine, follow these steps:

  1. Fulfill the web-app's package requirements:
user@local:~$ pip install -r <output_folder>/requirements.txt
  1. Run the flask server:
user@local:~$ bash <output_folder>/start_server.sh <port(optional)>
  1. Open an internet browser, then go to the URL described by the previous step:
  • e.g. * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
  • then go to http://0.0.0.0:5000 in your browser

Programmatic Access and Postprocessing

To access BiG-SLiCE's preprocessed data, (advanced) users need to be able to run SQL(ite) queries. Although the learning curve might be steeper compared to the conventional tabular-formatted output files, once familiarized, the SQL database can provide an easy-to-use yet very powerful data wrangling experience. Please refer to our publication manuscript to get an idea of what kind of things are able to be done with the output data. Additionally, you can also download and reuse some jupyter notebook scripts that we wrote to perform all analyses and generate figures for the manuscript.

What kind of software is this, anyway?

bgc_gcf_illustration

Bacteria and fungi produce a vast array of bioactive compounds in nature, which can be useful for us as antibiotics (see this list), antivirals (see this list) and anticancer drugs (see Salinisporamide). To optimize and retain the production of those complex chemical agents, microbes organize the responsible genes into genomic 'clumps' colloquially termed as "Biosynthetic Gene Clusters (BGCs)" (above picture, left panel). Using bioinformatics tools such as antiSMASH, we can now take a genome sequence to identify BGCs and predict the secondary metabolites that the organism may produce (see this example analysis for the S. coelicolor genome). Furthermore, by doing a large scale comparative analysis of homologous BGCs sharing similar domain architectures (we call them "Gene Cluster Families (GCFs)"), we can practically chart an atlas of biosynthetic diversity among all sequenced microbes (above picture, right panel).

figure_1

To enable such a large scale analysis, BiG-SLiCE was specifically designed with scalability and speed as the #1 priority (Figure 1A), as opposed to our previous tool, BiG-SCAPE, which was able to sensitively capture the slightest difference of both domain architecture and sequence similarity between pairs of BGCs (see our paper for the details). As a result, BiG-SLiCE can reliably take an input data of more than 1.2 million BGCs and process it in less than a week runtime using 36-cores machine with 128GB RAM (Figure 1B) while keeping enough sensitivity to delineate the essential biosynthetic 'signals' among the input BGCs (Figure 1C). Moreover, to facilitate exploration and investigation of the analysis results, BiG-SLiCE also produce an interactive, easy-to-use output visualization that can be run with minimal software / hardware requirements.

This software was initially developed and is currently maintained by Satria Kautsar (twitter: @satriaphd) as part of a fully funded PhD project granted to Dr. Marnix Medema (website: marnixmedema.nl, twitter: @marnixmedema) by the Graduate School of Experimental Plant Sciences, NL. Contributions and feedbacks are very welcomed. Feel free to drop us an e-mail if you have any question regarding or related to BiG-SLiCE. In the future, we aim to make BiG-SLiCE a comprehensive platform to do all sorts of downstream large scale BGC analysis, taking advantage of its portable and powerful SQLite3-based data storage combined with the flexible flask-based web app architecture as the foundation.

Find our software useful? Please cite!

Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154. https://doi.org/10.1093/gigascience/giaa154

bigslice's People

Contributors

friederikebiermann avatar louwersj avatar mdehollander avatar pseudopooja avatar satriaphd avatar waltersom avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.