Giter Site home page Giter Site logo

openrdp's People

Contributors

artpoon avatar brj1 avatar dpannguyen avatar gopigugan avatar kwade4 avatar sandeepthokala avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

openrdp's Issues

3Seq: no such file or directory

This exception is currently being thrown on master branch (commit 3a59bc1ae09c7fed4af4dd077c02d5abe9e097ad):

art@Wernstrom OpenRDP % python3 -i -m openrdp tests/test_neisseria.fasta test.csv -cfg tests/test_cfg.ini -all
Starting 3Seq Analysis
[Errno 2] No such file or directory: '/Users/art/git/OpenRDP/test_neisseria.fasta.3s.rec'
Finished 3Seq Analysis

Incorporate 3Seq method

Source code available at https://gitlab.com/lamhm/3seq/-/tree/master.

License:

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.
In short: you are free to share and make derivative works of the work under the conditions that you appropriately attribute it, you use the material only for non-commercial purposes, and that you distribute it only under a license compatible with this one.

Other metrics to determine which sequence is recombinant

From Darren:

OpenRDP uses only the vanilla phylpro methods to identify which of the
sequences is the recombinant in a triplet of sequences that yield a
recombinant signal. Is that right (again could not see any code for doing
that the last time I looked)? RDP uses vanilla phylpro together with 17
other metrics (including tree-distance based variations of phylpro) fed
through either a crude human-weighted decision tree or a logistic
regression weighted machine-learned formula to make this decision (this
latter thing was only added recently and may not be in the version you have
on hand).

Since this may be a new feature of RDP, we can prioritize implementing the other two improvements first (e.g., #54 )

Reconciling recombination events

From Darren:

OpenRDP does not seem to make an attempt to reconcile recombination
events that yielded the detected recombinant sequences with the recombinant
signals it detects. From what I understand, with OpenRDP if two sequences
in a datasets are both descended from an ancestral recombinant they will
not be identified as sharing evidence of the same ancestral
recombination event. Is that right? I couldn't see any code in openRDP for
reconciling/mapping the multitudes of detectable recombination signals
to individual actual recombination events.

First step is to locate this functionality in the RDP code that was provided

Incorporate geneconv method

  • Write commands to config file
  • Wrapper for geneconv
  • Format output for downstream analysis

Note: geneconv is freely distributed for academic use (https://www.math.wustl.edu/~sawyer/geneconv/)

The program GENECONV is free for academic use, but commercial rights are reserved.
The program may be freely distributed for academic use, as long as it is not altered or renamed.

Implement Bootscan

From the RDP5 manual:

  1. A window of set size is moved along the alignment a specified
    number of nucleotides at a time
  2. Bootstrap replicates of each window are constructed and pair-wise
    distances are calculated that can either themselves be used for a
    pair-wise distance BOOTSCAN or they can be used in a UPGMA
    or neighbour joining tree BOOTSCAN.
  3. At each window position the relative grouping (based either on
    pair-wise distances or tree positions) of every possible sequence
    triplet in the alignment is determined over all bootstrap replicates.
    Nucleotide sequence distances, and trees are all produced using
    recoded versions (in dna.dll) of the PHYLIP components
    DNADIST, and NEIGHBOR.
  4. Following completion of the last window in the scan, stored
    bootstrap data on pair-wise sequence relationships in every
    possible sequence triplet over all windows, is scanned for
    alterations in relative bootstrap support for sequence pairs. High
    degrees of bootstrap support alternating between two different
    sequence pairs are indicative of potential recombination
    events.
  5. Either binomial or chi-squared values can be calculated
    for identified regions.

Model Selection

I trained a Random Forest Classifier on the data (202 breakpoints) using an 80-20 test-train split and performed 5-fold cross validation.

So far, the best performing RF model has a cross validation accuracy of about 0.65. The accuracy for the test set is a little better than a guess (0.56), but the training accuracy is very high (approx 0.98).

It seems like the model is overfitting, so I will look into adding more data and see if the training and testing sets are balanced. I will also

Refactor to process triplets one at a time

Currently, each detection method receives a list containing all possible sequence triplets and runs for each triplet in the list. After all methods have executed, the possible recombination events must be consolidated into a singe output file. With this approach, the program must loop over all events for all the detection methods.

Refactoring the detection methods to receive a sequence triplet instead of a list of all triplets would eliminate the need to loop over the output of each method.

Licenses for binaries

3Seq and GENECONV are the only methods that are included as binaries.

3Seq is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License. Due to the "ShareAlike" part, we would have to distribute OpenRDP under the same license or a compatible license.

GENECONV does not specify a license, but instead the project site says "The program GENECONV is free for academic use, but commercial rights are reserved. The program may be freely distributed for academic use, as long as it is not altered or renamed."

Refining breakpoints with HMM

From Darren:

OpenRDP uses the breakpoint locations determined by the different
detection methods to identify breakpoint sites but RDP uses a sequence
triplet-based HMM method to refine the positions of breakpoint sites - also
didn't see any code for that in OpenRDP the last time I checked.

Incorporate CHIMAERA method

CHIMAERA is very similar to MaxChi, with only a few modifications. In each triplet, 2 sequences are considered as "parental" and 1 sequence is considered "recombinant".

  1. Monomorphic sites and sites at which neither of the 2 "parental" sequences match the selected "recombinant" are discarded.
  2. Each of the three sequences is encoded as a bit-string where 0 represents a match to one of the "parental" sequences and 0 represents a match to the other "parental" sequence.
  3. Calculate a 2x2 chi squared value in each half of the sliding window.

Threads and Memory options

Hi There

I am using OpenRDP pipeline on Linux for a reasonably large dataset of around 1200 sequences and noticed that all server resources (all threads, memory+ swap) are being utilised by OpenRDP after 3Seq analysis is completed. I could not find any option that allows me to limit the resources being used by the pipeline eg. the number of threads or maximum memory..

It would be great if control options for resources were provided at the top level of the pipeline such that it limits the resources for all tools.

Thanks
Sej

Process and validate command line arguments

The main pre-processing steps (common to all sequences) include:

  • Checking sequences are the same length
  • Compressing sequences
  • Categorizing sequences based on number of gaps and variants
  • Calculating initial pairwise hamming distances

In the original source code, the sequences are divided into chunks of 4 nucleotides and then converted to integers.

Upon installation of the program, the pairwise hamming distance is calculated for all 625 possible combinations of 5 characters (ATGC-) and these values are written to a file. During execution of the program, these values are stored in a lookup table, queried, and used to calculate the hamming distance.

Implement RDP Method

  1. Get every possible combination of 3 sequences (A, B, C)
  2. Construct UPGMA dendrogram from full alignment to identify the reference sequence.
  3. Remove uninformative sites (sites that are identical in all 3 sequences, or are different in all 3 sequences)
  4. Move window along shortened sequences, calculate pairwise identity for all possible pairs of sequences at every position.
  5. Identify regions where % identity is higher for A-C or B-C than A-B
  6. Calculate p-value using approximation of binomial distribution and apply Bonferroni correction
  7. When all sequence triplets have been examined, identify number of unique events
  8. Identify recombinant and parental sequences.

Choose a license

As noted in #21, 3Seq is distributed under Creative Commons BY-NC-SA 4.0, which prohibits commercial usage. Unfortunately this is not compatible with most, if not all, open source licenses for releasing software. We can either:

  1. Release OpenRDP under CC BY-NC-SA 4.0
  2. Come up with a custom license that also restricts commercial use (Google around for examples?)
  3. Remove 3Seq binaries from repository and have the user install it themselves (thereby agreeing to their license) in addition to GENECONV (see #21), and use whatever license we choose for our own code.

Strange gdk error messages in interactive mode

In [1]: from openrdp.main import openrdp
Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused

(ipython3:73710): Gdk-CRITICAL **: 21:15:01.446: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed

Handle variable length windows

Currently, OpenRDP only handles fixed-length windows. RDP allows users to specify the proportion of variable sites that should be included in the window.

Since each method records the indices of variable sites (poly_sites attribute), this should be straightforward to implement.

UnboundLocalError in both master and dev branch

commit 333eba7

(venv) art@Wernstrom OpenRDP % openrdp -c tests/test_cfg.ini tests/test_neisseria.fasta
Loading configuration from tests/test_cfg.ini
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Finished GENECONV Analysis
Setting up bootscan analysis...
Starting Scanning Phase of Bootscan/Recscan
Finished Scanning Phase of Bootscan/Recscan
Setting up maxchi analysis...
Setting up siscan analysis...
Setting up chimaera analysis...
Setting up rdp analysis...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/art/git/OpenRDP/venv/lib/python3.10/site-packages/OpenRDP-0.1.0-py3.10.egg/openrdp/bootscan.py", line 176, in execute
    print("Scanning triplet {} / {}".format(i, self.total_triplet_combinations))
UnboundLocalError: local variable 'i' referenced before assignment
"""

Update unit tests

I've probably broken most of them in refactoring the interface and package structure

General Sequence of Events

  1. Input validation (sequence lengths, properly formatted headers)

  2. Unmask / Mask (masked sequences omitted in initial analysis step)

  3. Initial analysis (command22_click)

    • Run BOOTSCAN, GENECOV, MaxChi, Chimaera, Siscan, 3SEQ, RDP
    • Identify number of unique recombination of events
    • Output: PDist, P-vals, #Hits
  4. Secondary analysis

    • Handle multiple identifications of same recombination event (if any) by selecting sequence with strongest signal, erasing a portion of the sequence, and reanalyzing, until no signals are detected.
    • Include masked sequences in analyses
    • Output: BPNum

User interface improvements

I'm not crazy about the current command-line invocation:

python3 -m openrdp ./openrdp/tests/test_neisseria.fasta ./test.csv -cfg ./openrdp/tests/test_cfg.ini -all
  • I would much prefer to enable users to simply call openrdp as an executable script in /usr/local/bin or /home/USER/.local/bin
  • There should be a default configuration, so the user does not have to use the -cfg flag.
  • Calling the script without setting -cfg proceeds to run the analysis until it throws an exception:
Setting Up Siscan Analysis
Scanning triplet 1 / 4
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/art/git/OpenRDP/openrdp/__main__.py", line 4, in <module>
    main()
  File "/home/art/git/OpenRDP/openrdp/main.py", line 173, in main
    openrdp(infile, outfile, cfg, run_geneconv, run_three_seq, run_rdp,
  File "/home/art/git/OpenRDP/openrdp/main.py", line 146, in openrdp
    scanner.run_scans(aln)
  File "/home/art/git/OpenRDP/openrdp/scripts/run_scans.py", line 139, in run_scans
    maxchi.execute(triplet)
  File "/home/art/git/OpenRDP/openrdp/scripts/maxchi.py", line 147, in execute
    win_size = triplet.get_win_size(0, self.win_size, self.fixed_win_size, self.num_var_sites,
AttributeError: 'MaxChi' object has no attribute 'win_size'

-cfg is listed as an optional argument in the help text, so it should behave as such!

Make Triplet a generator?

Triplet is an object class that is being initialized with the full lists of sequences and labels, and then three integers to index into these lists. Then we are looping over all possible triplets and constructing a Triplet object at each iteration. This seems really inefficient to me. I would rather call Triplet once and call its next method to retrieve the next object.

Incorrect coordinates due to indexing errors

In order identify breakpoints, I use a list of p-values that corresponds to each window position and only p-values below the threshold are added to this list. I later use this indices of the p-values to output the coordinates. But since the length of the list does not correspond to the number of polymorphic sites, the coordinates are incorrect.

No GENECONV binary executable for macOS

To get the source code to compile in macOS Catalina 10.15.7, I had to make a couple of modifications:

  1. Added void presskey(void); declaration to line 36 of geneconv.c
  2. Replaced all instances of getline with ggetline to avoid namespace collision with macOS SDK.

Look into performance of individual recombination methods

Before running its analyses RDP estimates the amount of memory needed, and gives an estimate for execution time. The documentation for RDP mentions that the user should "take note when you are told that the analysis you are proposing will take a number of days or weeks" (Section 3.1.4).

I wonder if this is a product of the methods themselves or RDP's pre-processing steps.

AttributeError: module '__main__' has no attribute '__spec__'

Encountered while working on #38:

art@Wernstrom OpenRDP % openrdp tests/test_neisseria.fasta test -c tests/test_cfg.ini 
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Finished GENECONV Analysis
Setting up geneconv analysis...
Setting up bootscan analysis...
Starting Scanning Phase of Bootscan/Recscan
Traceback (most recent call last):
  File "/usr/local/bin/openrdp", line 4, in <module>
    __import__('pkg_resources').run_script('OpenRDP==0.0.1', 'openrdp')
  File "/usr/local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/EGG-INFO/scripts/openrdp", line 30, in <module>
    results = openrdp.openrdp(args.infile, args.outfile, cfg=args.cfg,
  File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/__init__.py", line 293, in openrdp
    results = scanner.run_scans(aln)
  File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/__init__.py", line 174, in run_scans
    tmethods.append(a['method'](alignment, settings=settings, quiet=self.quiet))
  File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/bootscan.py", line 33, in __init__
    self.dists = self.do_scanning_phase(alignment)
  File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/bootscan.py", line 133, in do_scanning_phase
    with multiprocessing.Pool() as p:
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 183, in get_preparation_data
    main_mod_name = getattr(main_module.__spec__, "name", None)
AttributeError: module '__main__' has no attribute '__spec__'

I could not reproduce this problem from the master branch or from the dev branch, so it is probably something to do with changes I've made to the user interface. I suspect it has something to do with my merging the main scripts into a single executable file under /bin.

Discordance between OpenRDP and RDP v5.5

  • OpenRDP is currently reporting more recombination breakpoints than RDP v5.5 on Windows for small test cases, for all tests
  • this is particularly strange for tests where we are simply passing the data to a third-party binary executable, and then parsing the outputs - suggests that there is some problem with sequence pre-processing?
  • @GopiGugan can you please distribute RDP results either here or on Slack so we can work on this together?

Gather and clean data

  • Use HIV circulating recombinant forms
  • Look at output available from RDP's UI and command line

Random seed not properly set for Bootscan and Siscan

Multiple runs of Bootscan using the same test data yields different results, despite the fact that a random seed is set. This is also the case for Siscan.

The random seed is correctly parsed from the config file, so I suspect it is not being set before all calls to random.randint().

Packaging Python script

Hello PoonLab. Do you have any plans to package OpenRDP to make it pip or conda installable? I'm trying to incorporate recombination detection into my workflow within our organization's computer cluster. RDP4/5 being available as Window's binaries make this impossible as we're running on Linux servers. Your re-implementation of RDP looks extremely promising and capable of being integrated into a high-throughput command line workflow. Do you plan to actively maintain OpenRDP?

Output formatting

Results should be displayed on the command line interface in a nice tabular format, and in addition written to a file in a standard format like CSV (JSON would be useful for more structured data, but that would be less accessible for users).
Rows can correspond to breakpoints, grouped by method, so columns can be:

  • method
  • breakpoint location
  • parental fragment on left
  • parental fragment on right
  • p-value (confidence)
  • recombinant sequence identifier (in some cases) - this could mean multiple rows with the same breakpoint from identity by descent

Typo in default config

Hi There,

I got the following error when running the OpenRDP pipeline which is due to a minor typo in the default config file.
The script fails as it is looking for the key Siscan but the default_config.ini has parameters specified under [SisScan] in line 72.

Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused

(__main__.py:31383): Gdk-CRITICAL **: 14:36:25.273: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Finished GENECONV Analysis
Setting Up MaxChi Analysis
Setting Up Chimaera Analysis
Setting Up RDP Analysis
Setting Up Bootscan Analysis
Starting Scanning Phase of Bootscan/Recscan
Traceback (most recent call last):
  File "/usr/lib/python3.6/configparser.py", line 846, in items
    d.update(self._sections[section])
KeyError: 'Siscan'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/__main__.py", line 4, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/main.py", line 174, in main
    run_siscan, run_maxchi, run_chimaera, run_bootscan, args.quiet)
  File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/main.py", line 146, in openrdp
    scanner.run_scans(aln)
  File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/scripts/run_scans.py", line 125, in run_scans
    siscan = Siscan(alignment, settings=dict(config.items('Siscan')))
  File "/usr/lib/python3.6/configparser.py", line 849, in items
    raise NoSectionError(section)
configparser.NoSectionError: No section: 'Siscan'
Finished Scanning Phase of Bootscan/Recscan
Setting Up Siscan Analysis

Failed in installation

Hi PoonLab, I have met some problems when installing OpenRDP.
when I run sudo python3 setup.py install, it gotUserWarning: Unknown distribution option: 'define_macros' with many warnings and KeyError: '/mp-plvphf17' I am hoping you can help me solve this problem and really looking forward to using your program.

Incorporate MaxChi Method

Given an alignment MAXCHI examines sequence pairs and seeks to identify recombination breakpoints by looking for significant differences in the proportions of variable and non-variable polymorphic alignment positions in adjacent regions of the sequences.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.