statbiophys / olga Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 17.0 598 KB

Compute generation probabilities of CDR3 amino acid and nucleotide sequences

License: GNU General Public License v3.0

Python 100.00%

olga's People

Contributors

Stargazers

Watchers

Forkers

matsen thopic kmayerb alfaceor aurelbzh vventuri1 yohskua tdw1221 nottooxabi giulioisac andim aqzas weiwen1992 wangdi2014 entropicus96 zacmon lifanchen-simm

olga's Issues

Possibility for Numba integration?

Hi, would you guys be able to integrate Numba into the compute_aa_CDR3_pgen capabilities in this package? This may speed up things by a magnitude, which makes all the difference when you have datasets with more than 1e4 sequences (e.g. from OAS).

Sequence generation issue via CLI tool

Hi, I have an issue when running the olga-generate_sequences command via terminal. Im using the following exact command: olga-generate_sequences -d ',' --VDJ_model_folder olga_input -o cdr3_seqs.csv -n 50.

My input anchor csv files are comma separated and all the model files including V and J gene anchor files are located in the olga_input directory.

The program seems to run (it shows the starting sequence generation text in the terminal). Though it never generates a sequence and thus the output file remains empty (even after over 30 minutes of run time).

When pressing ctrl-c I get the following:

^CTraceback (most recent call last):
  File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/bin/olga-generate_sequences", line 11, in <module>
    sys.exit(main())
  File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/lib/python2.7/site-packages/olga/generate_sequences.py", line 286, in main
    ntseq, aaseq, V_in, J_in = seq_gen.gen_rnd_prod_CDR3(conserved_J_residues)
  File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/lib/python2.7/site-packages/olga/sequence_generation.py", line 211, in gen_rnd_prod_CDR3
    recomb_events = self.choose_random_recomb_events()
  File "/Users/woutvanhelvoirt/Documents/Virtualenvs/python2.7.15_1/lib/python2.7/site-packages/olga/sequence_generation.py", line 277, in choose_random_recomb_events
    recomb_events['delDl'] = delDldelDr_choice/self.num_delDr_poss
KeyboardInterrupt

Im using Python version 2.7.15

Cheers, Wout

TRBV21

Whenever TRBV21 is used for v_masking it is unrecognized. Any suggestion for a fix?

./compute_pgen.py --humanTRB CASSTGQANYGYTF --v_mask TRBV21

output:

These V genes/alleles are not recognized: TRBV21
No recognized V genes/alleles in the provided V_mask. Continuing without conditioning on V usage.

Specifying v_mask/j_mask for olga-generate_sequences

I hope I'm not missing something, but it would be nice if you could condition the generate-sequences command. Something like:

olga-generate_sequences --humanTRB -n5 --v_mask TRBV6-6

Combining the generation and Pgen OLGA command scripts

An improvement to the use experience would be to add both separate cli script under a single parser. I have done such a think in my own project as well and it's very easy to implement by using pythons ArgumentParser class and its ability to create sub parser objects:

import argparse

main_parser = argparse.ArgumentParser(prog="olga", description="")
# Add some options to the main parser here.
# Like options that can be used for all underlying programs (like model type or
# separator).

subparsers = main_parser.add_subparsers(help="", dest="subparser_name")
# Add a sub parser to the main parser here.

# Example:
subparsers.add_parser("GenerateSeqs", help="", description="")
# Add some options to the 'GenerateSeqs' sub parser here.
# Options only used for this specific parser.

# Parse the command line arguments. The sub parser name specifies which option
# to execute.
parsed_arguments = main_parser.parse_args()
if parsed_arguments.subparser_name == "GenerateSeqs":
    # Generate some sequences
    # Note that the 'parsed_arguments' object only contains option arguments based
    # on the given 'subparser_name' destination. This removes a lot of clutter.

Hope this helps!

Cheers, Wout

I got OLGA running in Python 3.7

I forked OLGA, ran 2to3 and touched up some integer-division issues:

https://github.com/dhmay/OLGA

It runs fine, but I haven't done anything to verify that it gives correct output. The integer-division stuff took about 20-30 minutes to fix up, so if you want to make OLGA work in 3.7 that's about the amount of time you could save by folding in my fork. :)

If you wanted to leave the codebase in 2.x but make it so a run of 2to3 would make it run in 3.7 (the reason I didn't just make a pull request), the manual changes I made are all in generation_probability.py. I made a function called "div3int" and called it wherever I made a change. All that would run fine in Python 2.x.

default model

hello,
i want to know your defalut model (human_T_beta) which from calculate pgen is from healthy individuals ?
best wishes

using vgene vs vfamily

Hello,

When I used full vgene information for v_masking I received a lot of NA values without any error/warning messages about incompatibility. When I ran again with just vfamily annotation for v_masking it resulted in all Pgen values as expected. Is there any reason that you can think of that may be causing this? Love this tool thank you for your help.

olga PyPi does not install numpy

minor issue as most machines will have this already installed, but the pip installation script doesn't include numpy, so this has to be installed separately

Idea for enhancements SequenceGenerationVDJ and missing D segment choice for ``gen_rnd_prod_CDR3``

Hello,
I started working on a project where we got TCRB CDR3 sequences of mice that were modified to have down-regulated or completely silenced TdT activity. I wanted to use your package to generate more of these sequences to compare results from actual data to randomly generated sequences.
Unfortunately, it seems it is not possible to generate random productive CDR3 beta sequences without inserted nucleotides from TdT with your current package. I tried to edit the source code locally and added a function to the SequenceGenerationVDJ class that created a sequence in the same way that you did only leaving out the inserted nucleotides, and it works pretty well and didn't take long to implement. I thought I might as well share it here so you might think of adding this as a feature sometime to have more control over the sequences produced. Of course, you could also do the same for SequenceGenerationVJ.
While I did this, I noticed that the gen_rnd_prod_CDR3 function for SequenceGenerationVDJ did not return the choice of D segment used. I assume this is a bug as I'd also like to know which D segment was used.

Cheers,
Gabe van den Hoeven

pgen value of 0

CAVEGYNTDKLIL 0.000000e+00
CAVERSTGGFKTIF 0.000000e+00
CAVRPLTSGSRLTF 0.000000e+00

The above seqeucnes have given me zero values from pgen humanTRA.

Does this suggest that my sequencing information is wrong?

How do we interpret zero probabilities if we find these in nature (assuming the output of MiXCR is correct)?

Possible issue with either setup.py or pathing command within OLGA

Hello! I'm updating our olga installation as I saw that BCR light chain models have been added. I'm managing an Anaconda environment for olga and ran python setup.py install without issue. However, I run into this error when I try to execute compute_pgen for the IGK model:

(olga) wyatt.mcdonnell@bespin1 [~] [09:55] > olga-compute_pgen --humanIGK
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
Check pathing... cannot find the model folder: /mnt/home/wyatt.mcdonnell/anaconda3/envs/olga/lib/python2.7/site-packages/olga-1.2.1-py2.7.egg/olga/default_models/human_B_kappa)
Exiting...

Not sure why this is happening, as everything looks to be present and in the right location:

(olga) wyatt.mcdonnell@bespin1 [~] [09:57] > ls /mnt/home/wyatt.mcdonnell/anaconda3/envs/olga/lib/python2.7/site-packages/olga-1.2.1-py2.7.egg/olga/default_models/human_B_kappa
.  ..  J_gene_CDR3_anchors.csv  model_marginals.txt  model_params.txt  V_gene_CDR3_anchors.csv
(olga) wyatt.mcdonnell@bespin1 [~] [09:57] > ls /mnt/home/wyatt.mcdonnell/anaconda3/envs/olga/lib/python2.7/site-packages/olga-1.2.1-py2.7.egg/olga/default_models/human_B_lambda
.  ..  J_gene_CDR3_anchors.csv  model_marginals.txt  model_params.txt  V_gene_CDR3_anchors.csv

Any ideas as to why this might be happening? Thanks!

advanced users - array/data frame inputs when loading a model

I will mostly use OLGA as importable package, but I have one issue by doing this. I still need to specify the model params and marginals files as well as the cdr3 anchor index files. In my package I will have to write my anchor files to csv first and specify the file location to OLGA's internal classes. Afterwards I'll have to delete the written csv files. Not optimal...

It would be very nice to be able to give these functions some sort of the data structure (numpy arrays or even a pandas data frame). In this scenario, the command line tool would have a utility function to parse the data files first into a table structure that is in turn handed over to the internal classes. Much more practical when extending this code.

Cheers, Wout

Advanced usage: separator should be variable for input files

The separator for the CSV files is selectable by the user when running OLGA through command-line, but not when importing OLGA into my own software. This means I have no control over what separator I would like to use in the CSV file containing the CDR3 anchors for the V and J genes.
Could you add this functionality? Cheers, Wout

Inconsistent argparse arguments

Hi,

The generate_sequences.py script uses VDJ_model_folder and the compute_pgen.py uses set_custom_model_VDJ to specify non-default IGoR-generated models. SONIA appears to use set_custom_model_VDJ exclusively. Might it be okay to change VDJ_model_folder to set_custom_model_VDJ in generate_sequences.py for consistency and a better user experience?

Thanks,
Zach

Gene sequence of TRBV31*01 of the mouse_T_beta default model does not match the IMGT gene sequence

Hello again,

I was diving into the source code of the VDJ sequence generation for my research project (I explained more in the issue linked here: #17 ), I wanted to also annotate for each of the generated sequences which segments were used. Since in the source code you used the index of a list of CDR3 + palindromal nt to refer to the segments, I matched the CDR3 sequences back to the gene segments on IMGT, and I found all segments except for TRBV31*01.
Looking deeper into it, I found the sequences were derived from the model_params.txt file. The sequence in this file does not match the sequence found on IMGT. I don't know if there is a reason for this, if so I'd like to know. The IMGT reference I'm talking about is found here:
https://www.imgt.org/ligmdb/view.action?id=IMGT000132

To make it easier I'll include the sequences here as well:
TRBV31*01 olga:
AGACTCCAGGCACAGAGGTAGAAGCCAGAGTGGCTGAGAAGCAGCTTCTCCGTGCTTAGGATGAATTGGTCGTCCTTCGGCCTGGAAGCTGAGAGGTTCAGTTGCACCACCGACTCTACCTGGCCAACAGTAATAGAGTAGAAGAGTTGCTGGAGGGTGCCTCCTGTGGCCTGCCAGTACCAGTAGAGGTTAGGGCTTGATTTCCCCTTTATGGTACACCCCAGAGACAGTGGGCTGCCCACAGCCTTGATCTCGGCAACTGGCCATTGATGGATAGTCTGAGC
TRBV31*01 IMGT:
GCTCAGACTATCCATCAATGGCCAGTTGCCGAGATCAAGGCTGTGGGCAGCCCACTGTCTCTGGGGTGTACCATAAAGGGGAAATCAAGCCCTAACCTCTACTGGTACTGGCAGGCCACAGGAGGCACCCTCCAGCAACTCTTCTACTCTATTACTGTTGGCCAGGTAGAGTCGGTGGTGCAACTGAACCTCTCAGCTTCCAGGCCGAAGGACGACCAATTCATCCTAAGCACGGAGAAGCTGCTTCTCAGCCACTCTGGCTTCTACCTCTGTGCCTGGAGTCT

Kind regards,
Gabe van den Hoeven

statbiophys / olga Goto Github PK

olga's People

Contributors

Stargazers

Watchers

Forkers

olga's Issues

Recommend Projects

Recommend Topics

Recommend Org