Giter Site home page Giter Site logo

aptamat's Introduction

AptaMat

Purpose

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the comparison of the matrices representing the two secondary structures to analyze, assimilable to dotplots. The dot-bracket notation of the structure is converted in a half binary matrix showing width equal to structure's length. Each matrix case (i,j) is filled with '1' if the nucleotide in position i is paired with the nucleotide in position j, with '0' otherwise.

The differences between matrices is calculated by applying Manhattan distance on each point in the template matrix against all the points from the compared matrix. This calculation is repeated between compared matrix and template matrix to handle all the differences. Both calculation are then sum up with the number of gaps encountered and divided by the sum of all the points in both matrices.

AptaMat can handle extended dot-bracket notation and every additional bracket is converted into coordinates for the matrix.

AptaMat can also compare structures of different length. However, we recommend to work with structure of same length. Our algorithm includes gap understanding, where each gap is considered as an additional penalized unpaired nucleotide.

Dependencies

AptaMat have been written in Python 3.8+

Two Python modules are needed :

These can be installed by typing in the command prompt either :

./setup

or

pip install numpy
pip install scipy

Use of Anaconda is highly recommended.

Usage

AptaMat is a flexible Python script which can take several arguments:

  • -structures followed by secondary structures written in dotbracket format

  • -weigths (Optionnal) followed by weight values between 0 and 1 to indicate optionnal weight indices

  • -files followed by path to formatted files containing one, or several secondary structures in dotbracket format

  • -ensemble(Optionnal) which indicates whether the input secondary structures are part of an ensemble

  • -method indicates the spatial distance method choose for AptaMat, by default cityblock and alternatively euclidean

    usage: AptaMat.py [-h] [-v] [-structures STRUCTURES [STRUCTURES ...]] [-weights WEIGHTS [WEIGHTS ...]] [-files FILES [FILES ...]] [-ensemble] [-method [{cityblock,euclidean}]]
    

Both structures and files are independent functions in the script and cannot be called at the same time.

The structures argument must be a string formatted secondary structures array. The first input structure is the template structure for the comparison. The following input are the compared structures. There are no input limitations. Quotes are necessary.

  usage: AptaMat.py -structures STRUCTURES [STRUCTURES ...]

The weight optionnal argument must be an array of float in 0 to 1 range showing identical size than input structures array. This argument is not compatible with files as the script is expecting this information to be in the input file.

  usage: AptaMat.py -structures STRUCTURES [STRUCTURES ...] -weigths WEIGHTS [WEIGHTS ...]

The files argument must be a formatted file. Multiple files can be parsed. The first structure encountered during the parsing is used as the template structure. The others are the compared structures.

  usage: AptaMat.py -files FILES [FILES ...]

The input must be a text file, containing at least secondary structures, and accept additional information such as Title, Sequence, Structure index and Weight . If several files are provided, the function parses the files one by one and always takes the first structure encountered as the template structure. Files must be formatted as follows:

  >5HRU
  TCGATTGGATTGTGCCGGAAGTGCTGGCTCGA
  --Template--
  ((((.........(((((.....)))))))))
  [ weight ]
  --Compared--
  .........(((.(((((.....))))).)))
  [ weight ]
  ..........((.((((.......)))).)).
  [ weight ]

ensemble is an optionnal argument which allow to calculate AptaMat distance value for an ensemble of structure instead of calculating pairwise distance.

  usage: AptaMat.py -structures STRUCTURES [STRUCTURES ...] -weigths WEIGHTS [WEIGHTS ...] -ensemble
      or
  usage: AptaMat.py -files FILES [FILES ...] -ensemble

Examples

structures function

First introducing a simple example with 2 structures:

  $ AptaMat.py -structures "(((...)))" "((.....))"
   (((...)))
   ((.....))
  > AptaMat : 0.4

Then, it is possible to input several structures:

  $ AptaMat.py -structures "(((...)))" "((.....))" ".(.....)." "(.......)"
  structure0 - structure1
   (((...)))
   ((.....))
  > AptaMat : 0.4

  structure0 - structure2
   (((...)))
   .(.....).
  > AptaMat : 1

  structure0 - structure3
   (((...)))
   (.......)
  > AptaMat : 1.5

files function

Taking the above file example:

  $ AptaMat.py -files example.fa
  Template - Compared1
   ((((.........(((((.....)))))))))
   .........(((.(((((.....))))).)))
  > AptaMat:
    1.588235294117647

  Template - Compared2
   ((((.........(((((.....)))))))))
   ..........((.((((.......)))).)).
  > AptaMat:
    1.6666666666666667

ensemble with input structures and weights

The four dotbracket used with -structures argument can be complete with -weights and -ensemble:

  $ AptaMat.py -structures "(((...)))" "((.....))" ".(.....)." "(.......)" -weights 0 0.5 0.3 0.2 -ensemble

  > AptaMat of structure set 
    0.8

ensemble in file

This time, we consider the above file as an ensemble and we complete the structure informations with weights

  >5HRU
  TCGATTGGATTGTGCCGGAAGTGCTGGCTCGA
  --Template--
  ((((.........(((((.....)))))))))
  --Compared1--
  .........(((.(((((.....))))).)))
  [ 0.6 ]
  --Compared2--
  ..........((.((((.......)))).)).
  [ 0.4 ]

Here is the result: $ AptaMat.py -files example.fa

  > AptaMat of structure set 
    3.2549019607843137

Note

Since AptaMat does not include automatic structure alignment, the choice of the software is up to the users.

Our papers observation have been made using Manhattan distance. Cutoff decision may be guided by the topic studied and also by the choice of the distance method (Euclidean or Manhattan).

For the moment, no features have been included to check whether the base pair is able to exist or not, according to literature. You must be careful about the sequence input and the base pairing associated.

Citation

If you are using AptaMat in your research, please support us by citing us : Thomas Binet, Bérangère Avalle, Miraine Dávila Felipe, Irene Maffucci, AptaMat: a matrix-based algorithm to compare single-stranded oligonucleotides secondary structures, Bioinformatics, Volume 39, Issue 1, January 2023, btac752, https://doi.org/10.1093/bioinformatics/btac752

aptamat's People

Contributors

gec-git avatar githubinet avatar

Stargazers

 avatar  avatar

Watchers

 avatar

aptamat's Issues

Allow comparison with not folded secondary structure

User may want to perform quantitative analysis and attribute distance to non folded oligonucleotides against folded anyway for example in pipeline. Different solution can be considered:

  • Give a default distance value to unfolded vs folded structure (worst solution)
  • Distance must be equal to the maximum number of base pair observable : len(structrure)//2. Several issues could arise from this:
    • How to manage with enhancement #7 ? Take the largest ? Shortest ?
    • It would give abnormally high distance value and will remains constistent even though different structure folding are compared to the same unfolded structure. Considering our main advantage over others algorithm, failed to rank at this point is not good.
  • Assign Manhattan Distance for each point in matrix ( the one showing folding) the farthest theoretical + 1 in the structure. This may give a large distance between the two structures no matter the size and the + 1 prevent an equality one distance with an actually folded structure showing the same coordinate than the farthest theoretical point. Moreover, we can obtain different score when comparing different folding to the same unfolded structure.

G-quadruplex/pseudoknot comprehension

Add features with G-quadruplex and pseudoknot comprehension.
This kind of secondary structures requires extended dotbracket notation. https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/rna_structure_notations.html

The '([{<' & string.ascii_uppercase is already included but some doubt remain about the comparison accuracy because no test have been done on this kind of secondary structure

  • Perform some try on Q-quadruplex & pseudoknots and conclude about comparison reliability. /!\ The complexity comes from the G-quadruplex structures. The tetrad can form base pair in many different way and some secondary structure notation can be similar. Here is an exemple of case with the same interacting Guanine
    GGTTGGTGTGGTTGG
    ([..[)...(]..])
    ((..)(...)(..))
  • #5

Different length support and optimal alignment

Allow different structure length alignment. This would surely needs an optimal structure alignment to make AptaMat distance the lowest for a shared motif.
Maybe we should consider the missing bases in the score calculation.

Is the algorithm time consuming ?

Considering the expected structure size (less than 100n) the calculation run quite fast. However, theoretically the calculation can takes time when the structure is larger with complexity around log(n^2).
Possible improvement can be considered as this time complexity is linked with the double browsing of dotbracket input

  • Think about the possibility of improving this bracket search.
  • Study the .ct notation for ssNA secondary structure (see in ".ct notation" enhancement)
  • #6
  • Test the algorithm with this new feature

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.