statbiophys / olga Goto Github PK
View Code? Open in Web Editor NEWCompute generation probabilities of CDR3 amino acid and nucleotide sequences
License: GNU General Public License v3.0
Compute generation probabilities of CDR3 amino acid and nucleotide sequences
License: GNU General Public License v3.0
Hi, I have an issue when running the olga-generate_sequences
command via terminal. Im using the following exact command: olga-generate_sequences -d ',' --VDJ_model_folder olga_input -o cdr3_seqs.csv -n 50
.
My input anchor csv files are comma separated and all the model files including V and J gene anchor files are located in the olga_input
directory.
The program seems to run (it shows the starting sequence generation
text in the terminal). Though it never generates a sequence and thus the output file remains empty (even after over 30 minutes of run time).
When pressing ctrl-c I get the following:
^CTraceback (most recent call last):
File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/bin/olga-generate_sequences", line 11, in <module>
sys.exit(main())
File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/lib/python2.7/site-packages/olga/generate_sequences.py", line 286, in main
ntseq, aaseq, V_in, J_in = seq_gen.gen_rnd_prod_CDR3(conserved_J_residues)
File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/lib/python2.7/site-packages/olga/sequence_generation.py", line 211, in gen_rnd_prod_CDR3
recomb_events = self.choose_random_recomb_events()
File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/lib/python2.7/site-packages/olga/sequence_generation.py", line 277, in choose_random_recomb_events
recomb_events['delDl'] = delDldelDr_choice/self.num_delDr_poss
KeyboardInterrupt
Im using Python version 2.7.15
Cheers, Wout
Whenever TRBV21 is used for v_masking it is unrecognized. Any suggestion for a fix?
./compute_pgen.py --humanTRB CASSTGQANYGYTF --v_mask TRBV21
output:
These V genes/alleles are not recognized: TRBV21
No recognized V genes/alleles in the provided V_mask. Continuing without conditioning on V usage.
I hope I'm not missing something, but it would be nice if you could condition the generate-sequences command. Something like:
olga-generate_sequences --humanTRB -n5 --v_mask TRBV6-6
An improvement to the use experience would be to add both separate cli script under a single parser. I have done such a think in my own project as well and it's very easy to implement by using pythons ArgumentParser
class and its ability to create sub parser objects:
import argparse
main_parser = argparse.ArgumentParser(prog="olga", description="")
# Add some options to the main parser here.
# Like options that can be used for all underlying programs (like model type or
# separator).
subparsers = main_parser.add_subparsers(help="", dest="subparser_name")
# Add a sub parser to the main parser here.
# Example:
subparsers.add_parser("GenerateSeqs", help="", description="")
# Add some options to the 'GenerateSeqs' sub parser here.
# Options only used for this specific parser.
# Parse the command line arguments. The sub parser name specifies which option
# to execute.
parsed_arguments = main_parser.parse_args()
if parsed_arguments.subparser_name == "GenerateSeqs":
# Generate some sequences
# Note that the 'parsed_arguments' object only contains option arguments based
# on the given 'subparser_name' destination. This removes a lot of clutter.
Hope this helps!
Cheers, Wout
I forked OLGA, ran 2to3 and touched up some integer-division issues:
It runs fine, but I haven't done anything to verify that it gives correct output. The integer-division stuff took about 20-30 minutes to fix up, so if you want to make OLGA work in 3.7 that's about the amount of time you could save by folding in my fork. :)
If you wanted to leave the codebase in 2.x but make it so a run of 2to3 would make it run in 3.7 (the reason I didn't just make a pull request), the manual changes I made are all in generation_probability.py. I made a function called "div3int" and called it wherever I made a change. All that would run fine in Python 2.x.
hello,
i want to know your defalut model (human_T_beta) which from calculate pgen is from healthy individuals ?
best wishes
Hello,
When I used full vgene information for v_masking I received a lot of NA values without any error/warning messages about incompatibility. When I ran again with just vfamily annotation for v_masking it resulted in all Pgen values as expected. Is there any reason that you can think of that may be causing this? Love this tool thank you for your help.
minor issue as most machines will have this already installed, but the pip installation script doesn't include numpy
, so this has to be installed separately
Hello,
I started working on a project where we got TCRB CDR3 sequences of mice that were modified to have down-regulated or completely silenced TdT activity. I wanted to use your package to generate more of these sequences to compare results from actual data to randomly generated sequences.
Unfortunately, it seems it is not possible to generate random productive CDR3 beta sequences without inserted nucleotides from TdT with your current package. I tried to edit the source code locally and added a function to the SequenceGenerationVDJ
class that created a sequence in the same way that you did only leaving out the inserted nucleotides, and it works pretty well and didn't take long to implement. I thought I might as well share it here so you might think of adding this as a feature sometime to have more control over the sequences produced. Of course, you could also do the same for SequenceGenerationVJ
.
While I did this, I noticed that the gen_rnd_prod_CDR3
function for SequenceGenerationVDJ
did not return the choice of D segment used. I assume this is a bug as I'd also like to know which D segment was used.
Cheers,
Gabe van den Hoeven
CAVEGYNTDKLIL 0.000000e+00
CAVERSTGGFKTIF 0.000000e+00
CAVRPLTSGSRLTF 0.000000e+00
The above seqeucnes have given me zero values from pgen humanTRA.
Does this suggest that my sequencing information is wrong?
How do we interpret zero probabilities if we find these in nature (assuming the output of MiXCR is correct)?
Hello! I'm updating our olga
installation as I saw that BCR light chain models have been added. I'm managing an Anaconda environment for olga
and ran python setup.py install
without issue. However, I run into this error when I try to execute compute_pgen
for the IGK model:
(olga) wyatt.mcdonnell@bespin1 [~] [09:55] > olga-compute_pgen --humanIGK
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
Check pathing... cannot find the model folder: /mnt/home/wyatt.mcdonnell/anaconda3/envs/olga/lib/python2.7/site-packages/olga-1.2.1-py2.7.egg/olga/default_models/human_B_kappa)
Exiting...
Not sure why this is happening, as everything looks to be present and in the right location:
(olga) wyatt.mcdonnell@bespin1 [~] [09:57] > ls /mnt/home/wyatt.mcdonnell/anaconda3/envs/olga/lib/python2.7/site-packages/olga-1.2.1-py2.7.egg/olga/default_models/human_B_kappa
. .. J_gene_CDR3_anchors.csv model_marginals.txt model_params.txt V_gene_CDR3_anchors.csv
(olga) wyatt.mcdonnell@bespin1 [~] [09:57] > ls /mnt/home/wyatt.mcdonnell/anaconda3/envs/olga/lib/python2.7/site-packages/olga-1.2.1-py2.7.egg/olga/default_models/human_B_lambda
. .. J_gene_CDR3_anchors.csv model_marginals.txt model_params.txt V_gene_CDR3_anchors.csv
Any ideas as to why this might be happening? Thanks!
I will mostly use OLGA as importable package, but I have one issue by doing this. I still need to specify the model params and marginals files as well as the cdr3 anchor index files. In my package I will have to write my anchor files to csv first and specify the file location to OLGA's internal classes. Afterwards I'll have to delete the written csv files. Not optimal...
It would be very nice to be able to give these functions some sort of the data structure (numpy arrays or even a pandas data frame). In this scenario, the command line tool would have a utility function to parse the data files first into a table structure that is in turn handed over to the internal classes. Much more practical when extending this code.
Cheers, Wout
The separator for the CSV files is selectable by the user when running OLGA through command-line, but not when importing OLGA into my own software. This means I have no control over what separator I would like to use in the CSV file containing the CDR3 anchors for the V and J genes.
Could you add this functionality? Cheers, Wout
Hi,
The generate_sequences.py
script uses VDJ_model_folder
and the compute_pgen.py
uses set_custom_model_VDJ
to specify non-default IGoR-generated models. SONIA appears to use set_custom_model_VDJ
exclusively. Might it be okay to change VDJ_model_folder
to set_custom_model_VDJ
in generate_sequences.py
for consistency and a better user experience?
Thanks,
Zach
Hello again,
I was diving into the source code of the VDJ sequence generation for my research project (I explained more in the issue linked here: #17 ), I wanted to also annotate for each of the generated sequences which segments were used. Since in the source code you used the index of a list of CDR3 + palindromal nt to refer to the segments, I matched the CDR3 sequences back to the gene segments on IMGT, and I found all segments except for TRBV31*01.
Looking deeper into it, I found the sequences were derived from the model_params.txt
file. The sequence in this file does not match the sequence found on IMGT. I don't know if there is a reason for this, if so I'd like to know. The IMGT reference I'm talking about is found here:
https://www.imgt.org/ligmdb/view.action?id=IMGT000132
To make it easier I'll include the sequences here as well:
TRBV31*01 olga:
AGACTCCAGGCACAGAGGTAGAAGCCAGAGTGGCTGAGAAGCAGCTTCTCCGTGCTTAGGATGAATTGGTCGTCCTTCGGCCTGGAAGCTGAGAGGTTCAGTTGCACCACCGACTCTACCTGGCCAACAGTAATAGAGTAGAAGAGTTGCTGGAGGGTGCCTCCTGTGGCCTGCCAGTACCAGTAGAGGTTAGGGCTTGATTTCCCCTTTATGGTACACCCCAGAGACAGTGGGCTGCCCACAGCCTTGATCTCGGCAACTGGCCATTGATGGATAGTCTGAGC
TRBV31*01 IMGT:
GCTCAGACTATCCATCAATGGCCAGTTGCCGAGATCAAGGCTGTGGGCAGCCCACTGTCTCTGGGGTGTACCATAAAGGGGAAATCAAGCCCTAACCTCTACTGGTACTGGCAGGCCACAGGAGGCACCCTCCAGCAACTCTTCTACTCTATTACTGTTGGCCAGGTAGAGTCGGTGGTGCAACTGAACCTCTCAGCTTCCAGGCCGAAGGACGACCAATTCATCCTAAGCACGGAGAAGCTGCTTCTCAGCCACTCTGGCTTCTACCTCTGTGCCTGGAGTCT
Kind regards,
Gabe van den Hoeven
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.