poonlab / openrdp Goto Github PK
View Code? Open in Web Editor NEWAn open-source re-implementation of the RDP4 recombination detection program
License: GNU General Public License v3.0
An open-source re-implementation of the RDP4 recombination detection program
License: GNU General Public License v3.0
This exception is currently being thrown on master branch (commit 3a59bc1ae09c7fed4af4dd077c02d5abe9e097ad
):
art@Wernstrom OpenRDP % python3 -i -m openrdp tests/test_neisseria.fasta test.csv -cfg tests/test_cfg.ini -all
Starting 3Seq Analysis
[Errno 2] No such file or directory: '/Users/art/git/OpenRDP/test_neisseria.fasta.3s.rec'
Finished 3Seq Analysis
Source code available at https://gitlab.com/lamhm/3seq/-/tree/master.
License:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.
In short: you are free to share and make derivative works of the work under the conditions that you appropriately attribute it, you use the material only for non-commercial purposes, and that you distribute it only under a license compatible with this one.
From Darren:
OpenRDP uses only the vanilla phylpro methods to identify which of the
sequences is the recombinant in a triplet of sequences that yield a
recombinant signal. Is that right (again could not see any code for doing
that the last time I looked)? RDP uses vanilla phylpro together with 17
other metrics (including tree-distance based variations of phylpro) fed
through either a crude human-weighted decision tree or a logistic
regression weighted machine-learned formula to make this decision (this
latter thing was only added recently and may not be in the version you have
on hand).
Since this may be a new feature of RDP, we can prioritize implementing the other two improvements first (e.g., #54 )
From Darren:
OpenRDP does not seem to make an attempt to reconcile recombination
events that yielded the detected recombinant sequences with the recombinant
signals it detects. From what I understand, with OpenRDP if two sequences
in a datasets are both descended from an ancestral recombinant they will
not be identified as sharing evidence of the same ancestral
recombination event. Is that right? I couldn't see any code in openRDP for
reconciling/mapping the multitudes of detectable recombination signals
to individual actual recombination events.
First step is to locate this functionality in the RDP code that was provided
geneconv
Note: geneconv is freely distributed for academic use (https://www.math.wustl.edu/~sawyer/geneconv/)
The program GENECONV is free for academic use, but commercial rights are reserved.
The program may be freely distributed for academic use, as long as it is not altered or renamed.
Create JSON config file to handle arguments for each detection method
See #5
RDP5 removes sequences that are identical. but currently OpenRDP does not.
From the RDP5 manual:
I don't like calling os.remove()
on files with wildcards - there's the off chance that we delete user files by mistake. For cases where I need a script to write data out to the filesystem, I strongly prefer to use the tempfile
module.
I trained a Random Forest Classifier on the data (202 breakpoints) using an 80-20 test-train split and performed 5-fold cross validation.
So far, the best performing RF model has a cross validation accuracy of about 0.65. The accuracy for the test set is a little better than a guess (0.56), but the training accuracy is very high (approx 0.98).
It seems like the model is overfitting, so I will look into adding more data and see if the training and testing sets are balanced. I will also
Currently, each detection method receives a list containing all possible sequence triplets and runs for each triplet in the list. After all methods have executed, the possible recombination events must be consolidated into a singe output file. With this approach, the program must loop over all events for all the detection methods.
Refactoring the detection methods to receive a sequence triplet instead of a list of all triplets would eliminate the need to loop over the output of each method.
3Seq and GENECONV are the only methods that are included as binaries.
3Seq is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License. Due to the "ShareAlike" part, we would have to distribute OpenRDP under the same license or a compatible license.
GENECONV does not specify a license, but instead the project site says "The program GENECONV is free for academic use, but commercial rights are reserved. The program may be freely distributed for academic use, as long as it is not altered or renamed."
From Darren:
OpenRDP uses the breakpoint locations determined by the different
detection methods to identify breakpoint sites but RDP uses a sequence
triplet-based HMM method to refine the positions of breakpoint sites - also
didn't see any code for that in OpenRDP the last time I checked.
CHIMAERA is very similar to MaxChi, with only a few modifications. In each triplet, 2 sequences are considered as "parental" and 1 sequence is considered "recombinant".
Hi There
I am using OpenRDP pipeline on Linux for a reasonably large dataset of around 1200 sequences and noticed that all server resources (all threads, memory+ swap) are being utilised by OpenRDP after 3Seq analysis is completed. I could not find any option that allows me to limit the resources being used by the pipeline eg. the number of threads or maximum memory..
It would be great if control options for resources were provided at the top level of the pipeline such that it limits the resources for all tools.
Thanks
Sej
The main pre-processing steps (common to all sequences) include:
In the original source code, the sequences are divided into chunks of 4 nucleotides and then converted to integers.
Upon installation of the program, the pairwise hamming distance is calculated for all 625 possible combinations of 5 characters (ATGC-) and these values are written to a file. During execution of the program, these values are stored in a lookup table, queried, and used to calculate the hamming distance.
As noted in #21, 3Seq is distributed under Creative Commons BY-NC-SA 4.0, which prohibits commercial usage. Unfortunately this is not compatible with most, if not all, open source licenses for releasing software. We can either:
In [1]: from openrdp.main import openrdp
Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused
(ipython3:73710): Gdk-CRITICAL **: 21:15:01.446: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
Some breakpoints are reported twice or even three times. I think this issue may be related to how the chi-sqaured values are processed to identify "peaks".
Currently, OpenRDP only handles fixed-length windows. RDP allows users to specify the proportion of variable sites that should be included in the window.
Since each method records the indices of variable sites (poly_sites
attribute), this should be straightforward to implement.
commit 333eba7
(venv) art@Wernstrom OpenRDP % openrdp -c tests/test_cfg.ini tests/test_neisseria.fasta
Loading configuration from tests/test_cfg.ini
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Finished GENECONV Analysis
Setting up bootscan analysis...
Starting Scanning Phase of Bootscan/Recscan
Finished Scanning Phase of Bootscan/Recscan
Setting up maxchi analysis...
Setting up siscan analysis...
Setting up chimaera analysis...
Setting up rdp analysis...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/art/git/OpenRDP/venv/lib/python3.10/site-packages/OpenRDP-0.1.0-py3.10.egg/openrdp/bootscan.py", line 176, in execute
print("Scanning triplet {} / {}".format(i, self.total_triplet_combinations))
UnboundLocalError: local variable 'i' referenced before assignment
"""
I've probably broken most of them in refactoring the interface and package structure
Input validation (sequence lengths, properly formatted headers)
Unmask / Mask (masked sequences omitted in initial analysis step)
Initial analysis (command22_click
)
Secondary analysis
I'm not crazy about the current command-line invocation:
python3 -m openrdp ./openrdp/tests/test_neisseria.fasta ./test.csv -cfg ./openrdp/tests/test_cfg.ini -all
openrdp
as an executable script in /usr/local/bin
or /home/USER/.local/bin
-cfg
flag.-cfg
proceeds to run the analysis until it throws an exception:Setting Up Siscan Analysis
Scanning triplet 1 / 4
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/art/git/OpenRDP/openrdp/__main__.py", line 4, in <module>
main()
File "/home/art/git/OpenRDP/openrdp/main.py", line 173, in main
openrdp(infile, outfile, cfg, run_geneconv, run_three_seq, run_rdp,
File "/home/art/git/OpenRDP/openrdp/main.py", line 146, in openrdp
scanner.run_scans(aln)
File "/home/art/git/OpenRDP/openrdp/scripts/run_scans.py", line 139, in run_scans
maxchi.execute(triplet)
File "/home/art/git/OpenRDP/openrdp/scripts/maxchi.py", line 147, in execute
win_size = triplet.get_win_size(0, self.win_size, self.fixed_win_size, self.num_var_sites,
AttributeError: 'MaxChi' object has no attribute 'win_size'
-cfg
is listed as an optional argument in the help text, so it should behave as such!
Triplet is an object class that is being initialized with the full lists of sequences and labels, and then three integers to index into these lists. Then we are looping over all possible triplets and constructing a Triplet object at each iteration. This seems really inefficient to me. I would rather call Triplet once and call its next
method to retrieve the next object.
Line 137 in d5cb639
This will cause aln
to become a smaller list in the presence of identical sequences. However, seq_names
is a list that stores the sequence labels from the input alignment, and is not modified.
In order identify breakpoints, I use a list of p-values that corresponds to each window position and only p-values below the threshold are added to this list. I later use this indices of the p-values to output the coordinates. But since the length of the list does not correspond to the number of polymorphic sites, the coordinates are incorrect.
To get the source code to compile in macOS Catalina 10.15.7, I had to make a couple of modifications:
void presskey(void);
declaration to line 36 of geneconv.c
getline
with ggetline
to avoid namespace collision with macOS SDK.Currently we have an openrdp
function that acts as a wrapper for the Scanner object class:
Line 290 in d5cb639
I would prefer to let the user directly work with Scanner, so that the configuration settings can be modified on the fly.
Before running its analyses RDP estimates the amount of memory needed, and gives an estimate for execution time. The documentation for RDP mentions that the user should "take note when you are told that the analysis you are proposing will take a number of days or weeks" (Section 3.1.4).
I wonder if this is a product of the methods themselves or RDP's pre-processing steps.
Encountered while working on #38:
art@Wernstrom OpenRDP % openrdp tests/test_neisseria.fasta test -c tests/test_cfg.ini
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Finished GENECONV Analysis
Setting up geneconv analysis...
Setting up bootscan analysis...
Starting Scanning Phase of Bootscan/Recscan
Traceback (most recent call last):
File "/usr/local/bin/openrdp", line 4, in <module>
__import__('pkg_resources').run_script('OpenRDP==0.0.1', 'openrdp')
File "/usr/local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 672, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1472, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/EGG-INFO/scripts/openrdp", line 30, in <module>
results = openrdp.openrdp(args.infile, args.outfile, cfg=args.cfg,
File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/__init__.py", line 293, in openrdp
results = scanner.run_scans(aln)
File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/__init__.py", line 174, in run_scans
tmethods.append(a['method'](alignment, settings=settings, quiet=self.quiet))
File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/bootscan.py", line 33, in __init__
self.dists = self.do_scanning_phase(alignment)
File "/usr/local/lib/python3.10/site-packages/OpenRDP-0.0.1-py3.10.egg/openrdp/bootscan.py", line 133, in do_scanning_phase
with multiprocessing.Pool() as p:
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 215, in __init__
self._repopulate_pool()
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 183, in get_preparation_data
main_mod_name = getattr(main_module.__spec__, "name", None)
AttributeError: module '__main__' has no attribute '__spec__'
I could not reproduce this problem from the master branch or from the dev branch, so it is probably something to do with changes I've made to the user interface. I suspect it has something to do with my merging the main scripts into a single executable file under /bin
.
Multiple runs of Bootscan using the same test data yields different results, despite the fact that a random seed is set. This is also the case for Siscan.
The random seed is correctly parsed from the config file, so I suspect it is not being set before all calls to random.randint()
.
Hello PoonLab. Do you have any plans to package OpenRDP to make it pip or conda installable? I'm trying to incorporate recombination detection into my workflow within our organization's computer cluster. RDP4/5 being available as Window's binaries make this impossible as we're running on Linux servers. Your re-implementation of RDP looks extremely promising and capable of being integrated into a high-throughput command line workflow. Do you plan to actively maintain OpenRDP?
Results should be displayed on the command line interface in a nice tabular format, and in addition written to a file in a standard format like CSV (JSON would be useful for more structured data, but that would be less accessible for users).
Rows can correspond to breakpoints, grouped by method, so columns can be:
Hi There,
I got the following error when running the OpenRDP pipeline which is due to a minor typo in the default config file.
The script fails as it is looking for the key Siscan
but the default_config.ini has parameters specified under [SisScan]
in line 72.
Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused
(__main__.py:31383): Gdk-CRITICAL **: 14:36:25.273: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
Starting 3Seq Analysis
Finished 3Seq Analysis
Starting GENECONV Analysis
Finished GENECONV Analysis
Setting Up MaxChi Analysis
Setting Up Chimaera Analysis
Setting Up RDP Analysis
Setting Up Bootscan Analysis
Starting Scanning Phase of Bootscan/Recscan
Traceback (most recent call last):
File "/usr/lib/python3.6/configparser.py", line 846, in items
d.update(self._sections[section])
KeyError: 'Siscan'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/__main__.py", line 4, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/main.py", line 174, in main
run_siscan, run_maxchi, run_chimaera, run_bootscan, args.quiet)
File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/main.py", line 146, in openrdp
scanner.run_scans(aln)
File "/usr/local/lib/python3.6/dist-packages/OpenRDP-0.0.1-py3.6.egg/openrdp/scripts/run_scans.py", line 125, in run_scans
siscan = Siscan(alignment, settings=dict(config.items('Siscan')))
File "/usr/lib/python3.6/configparser.py", line 849, in items
raise NoSectionError(section)
configparser.NoSectionError: No section: 'Siscan'
Finished Scanning Phase of Bootscan/Recscan
Setting Up Siscan Analysis
Hi PoonLab, I have met some problems when installing OpenRDP.
when I run sudo python3 setup.py install
, it gotUserWarning: Unknown distribution option: 'define_macros'
with many warnings and KeyError: '/mp-plvphf17'
I am hoping you can help me solve this problem and really looking forward to using your program.
hold on this until we have resolved #48 (make results concordant with RDP)
Given an alignment MAXCHI examines sequence pairs and seeks to identify recombination breakpoints by looking for significant differences in the proportions of variable and non-variable polymorphic alignment positions in adjacent regions of the sequences.
Windows, Linux, Mac binaries available here http://mateo.fourment.free.fr/software.html (see #1)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.