phbradley / tcr-dist Goto Github PK
View Code? Open in Web Editor NEWSoftware tools for the analysis of epitope-specific T cell receptor (TCR) repertoires (scroll down for the README)
License: MIT License
Software tools for the analysis of epitope-specific T cell receptor (TCR) repertoires (scroll down for the README)
License: MIT License
I think currently we exclude all out of frame CDR3s from any part of the analysis after sequence parsing. It's good to exclude these from analysis since they are non-functional, but it is important to track their clonality, as they are a correlate of clonality signals for functional CDR3s.
When this python script invokes file_converter.py, I guess it may use the wrong--check_genes flag cause it is dont_check_genes in file_converter.py.
Dear Phil,
After reading your paper I would be interested in checking identifying shared motifs in a particular subset of sorted T cells that undergo strong TCR signaling during their development in thymus. I'm quite new with python and I have been able to set up the enviroment and installl packages and all dependecies in windows using WSL.
Unfortunately I have only single chain files (alpha and beta) from several subjects. Is there a way I can adapt the input to perform the analysis only with single chain files?
Thanks in advance for your help :)
Guillem Sanchez
Can tcrdist take cdr3 amino acid sequences as inputs or does it only take the complete nucleotide sequence covering cdr1,2, and 3?
Hi there,
When I setup gammadelta database, I found in line 66 still set your own path. It might better provide file in github.
Best,
Yan
Hi there,
Great work on the package!
I've been using tcrdist3 to compute pairwise distance between my tcr sequences. I want to say that one group of tcr sequences are more clustered (more close together, more similar) than another group. How best could I go about testing this? Should I go from the network and calculate say number of nodes for each group, or from the distance matrix and use PERMANOVA or something similar?
Universal newline support was removed when the JCC cleaning code was removed. Need to add universal newline support back.
I've noticed that in some situations the align_cdr3s function in tcr_distances.py will fail to find the best_gappos, causing an error. However, re-running the same dataset, I was able to get it to succeed about 20% of the time. I was unable to find any consistency in the center_cdr3 and member_cdr3 pairing that caused this to fail. Examples include:
DVGYKL DPAGNTGKL
GEGSNNRI GYNTNTGKL
GDRYAQGL GDVDYAQGL
But in each of these cases, rerunning the code eventually resulted in getting through these cases without issue. I assume the stochasticity is introduced by the random_seend. Any thoughts on how to best address this @phbradley ?
Traceback (most recent call last):
File "/mnt/Data/TCR_Git/public_pipeline/tcr-dist/make_tall_trees.py", line 683, in
a,b = align_cdr3s( center_cdr3, member_cdr3, gap_character )
File "/mnt/Data/TCR_Git/public_pipeline/tcr-dist/tcr_distances.py", line 98, in align_cdr3s
s0 = s0[:best_gappos+1] + gap_character*lendiff + s0[best_gappos+1:]
UnboundLocalError: local variable 'best_gappos' referenced before assignment
It is currently possible for sequences that are missing required parts of certain gene segments to evade the parts of our pipeline that filter out of frame TCRs. For instance: AAGGCCCTGCCCAGCTAATCTTAATACGTTCAAATGAGCGAGAGAAGCGCAGTGGAAGACTCAGAGCCACCCTTGACACTTCCAGCCAGAGCAGCTCCCTGTCCATCACTGGTACTCTAGCTACAGACACTGCTGTGTACTTCTGTGCTACTGATAAGGCTGGAGGACTAAGTGACATCCAGAACCCAGAACCTGCTGTGTACTGACACCCCAGATCGGAAGAGCGTCGTGTAGGGAAAGAG produces what the pipeline considers a valid TCR, but it is clear that a portion of the J segment is missing. I do not currently believe that this is a widespread issue when the quality of the data is good; however, certain bulk-sequencing approaches that some use to prepare data for TCRdist use assembly algorithms in the process, and this assembly can introduce errors that TCRdist should be able to identify as problematic during parsing. (This particular sequence was generated by MiXCR.)
Hello,
I am trying to run TCR-dist on a data set of parsed TCR alpha chains.
An abridged version of my data set is here in .txt format:
clones_file.txt
The code I used to run the basic analysis script is as follows:
python /Users/cajames2/tcr-dist/run_basic_analysis.py --organism human --parsed_seqs_file /Users/cajames2/TCRSeq/clones_file.tsv --make_fake_beta --make_fake_quals
The script then runs all the way through, but returns blank tables and plots. I ran the test "test_small_human_pairseqs_v1_parsed_seqs.tsv " data set and saw outputs. I also deleted beta chain columns and quality scores and ran only the alpha chain information with --make_fake_beta and --make_fake_quals and it worked just fine.
I think I have traced the issue to something that the parse_tsv_file function is dependent on. I modified the parse_tsv.py script to print the all_clones file so that I could see whether my data was being read correctly and this file is blank after I run the run_basic_analysis.py script. However, when I run the parse_tsv.py script on my data independently, it reads my data and generates a populated all_clones file.
Do you have any insight into why the parse_tsv_file function won't read my data when run in the context of the run_basic_analysis.py script, but works just fine when run independently?
Hi!dear team,
Using the tcr-dist plug-in find a surprise result that i really care about,especially the result of make_tall_trees.py,but make_tall_trees.py just output a figure output,as shown below:
in this fingure , the seq-logo sub-cluster result in the left is really i want , Can this result be output to text cluster by cluster? Looking forward to your reply,thanks
Hello,
I currently only have V gene-level information and the first few rows of my "parsed_seqs_file" looks like this table below. My error comes from the "all_genes.py" script not able to find the gene in this file "alphabeta_db.tsv" because it doesn't have the allele information (example: TRAV26-1 not found, when expecting something like TRAV26-1*01). Is there a way to run tcr-dist without this information? Thank you so much for any advice.
id | epitope | subject | va_gene | ja_gene | vb_gene | jb_gene | cdr3a | cdr3a_nucseq | cdr3b | cdr3b_nucseq | va_reps | ja_reps | vb_reps | jb_reps | va_countreps | ja_countreps | vb_countreps | jb_countreps | cdr3a_quals | cdr3b_quals | va_genes | vb_genes | ja_genes | jb_genes | va_rep | ja_rep | vb_rep | jb_rep |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
345 | Nef | 5 | TRAV26-1 | TRAJ43 | TRBV13 | TRBJ1-5 | CIVRAPGRADMRF | TGCATTGTGCGCGCGCCGGGCCGCGCGGATATGCGCTTT | CASSYLPGQGDHYSNQPQHF | TGCGCGAGCAGCTATCTGCCGGGCCAGGGCGATCATTATAGCAACCAGCCGCAGCATTTT | TRAV26-1 | TRAJ43 | TRBV13 | TRBJ1-5 | TRAV26-1 | TRAJ43 | TRBV13 | TRBJ1-5 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | TRAV26-1 | TRBV13 | TRAJ43 | TRBJ1-5 | TRAV26-1 | TRAJ43 | TRBV13 | TRBJ1-5 |
3871 | p65 | 1 | TRAV8-6 | TRAJ30 | TRBV28 | TRBJ2-7 | CAVSDKNRDDKIIF | TGCGCGGTGAGCGATAAAAACCGCGATGATAAAATTATTTTT | CASRPGTASYEQYF | TGCGCGAGCCGCCCGGGCACCGCGAGCTATGAACAGTATTTT | TRAV8-6 | TRAJ30 | TRBV28 | TRBJ2-7 | TRAV8-6 | TRAJ30 | TRBV28 | TRBJ2-7 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | TRAV8-6 | TRBV28 | TRAJ30 | TRBJ2-7 | TRAV8-6 | TRAJ30 | TRBV28 | TRBJ2-7 |
3740 | p65 | 104 | TRAV3 | TRAJ12 | TRBV7-9 | TRBJ2-7 | CATVSRMDSSYKLIF | TGCGCGACCGTGAGCCGCATGGATAGCAGCTATAAACTGATTTTT | CASSLIGEGTGWHQYF | TGCGCGAGCAGCCTGATTGGCGAAGGCACCGGCTGGCATCAGTATTTT | TRAV3 | TRAJ12 | TRBV7-9 | TRBJ2-7 | TRAV3 | TRAJ12 | TRBV7-9 | TRBJ2-7 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | TRAV3 | TRBV7-9 | TRAJ12 | TRBJ2-7 | TRAV3 | TRAJ12 | TRBV7-9 | TRBJ2-7 |
3742 | p65 | 106 | TRAV16 | TRAJ26 | TRBV3-1 | TRBJ1-1 | CADYYGQNFVF | TGCGCGGATTATTATGGCCAGAACTTTGTGTTT | CASSFQGYTEAFF | TGCGCGAGCAGCTTTCAGGGCTATACCGAAGCGTTTTTT | TRAV16 | TRAJ26 | TRBV3-1 | TRBJ1-1 | TRAV16 | TRAJ26 | TRBV3-1 | TRBJ1-1 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | 99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99 | TRAV16 | TRBV3-1 | TRAJ26 | TRBJ1-1 | TRAV16 | TRAJ26 | TRBV3-1 | TRBJ1-1 |
My script run is this:
python2 run_basic_analysis.py --organism human --parsed_seqs_file parsed_seqs_file --make_fake_quals --no_probabilities > parsed_seqs_file.out
-Gabrielle
Hello,
Thanks for developing such great tools!
I am wondering if there is a way to apply tcr-dist on cdr3 amino acid sequences.
Currently, the instructions say it is required "nucleotide sequences of the TCR alpha and beta chain reads". However, we got a dataset with amino acid sequences instead of nucleotide sequences and would like to try your methods.
Thanks so much for your help in advance!
Best regards,
Ruoxing
Good evening,
after reading your paper I would be interested in estimating TCRdiv of a series of epitope-specific repertoires, however I do not find a script dedicated only to it without the need of running the full analysis. Could you please help me with this? Does one need also the alpha-chain to estimate TCRdiv?
Many thanks, best regards,
Barbara
"The Windows API imposes a maximum filename length such that a filename, including the file path to get to the file, can't exceed between 255-260 characters." Filenames that violate this limit can cause problems for users on Windows machines. Some of the path/filename combinations generated automatically by the pipeline get quite long. We may want to shorten names where possible and/or include information on placing results in a base filepath for viewing when they encounter problems.
Hello,
I am running TCRdist, and I really like it! Thank you for taking the time to make it.
However, I am noticing that I am losing a considerable amount of my TCRs. As in I have ~2600 clones with complete vdj and cdr3 information, but I only end up with a distance matrix of around 200 clones.
What type of filtering is occurring to make me lose so many clones when running TCRdist. These cells are sorted on tetremer positive cells and also have at least two cells per clone. I also did my own quality control to ensure that they have all information needed. What is causing me to lose so many clones?
Thanks!
Dear Dr. Philip Bradley,
Your work (TCRdist) published in Nature is so interesting that I want to apply it on our TCR data analysis.
As your GitHub described, The input file of TCR-dist has three mode, pair_seq_file, parsed_seq_file and clones_file. Due to some sequencing methods reasons, it is hard for me to pair the alpha and beta chain reads from one TCR.
I just want to try another two input mode because we had used MIXCR to produce some result (MIXCR output).
Could you tell me how to prepare another two input mode file, please?
Best,
Baifeng
I am using Python 2.7 under conda environment to run TCRdist on my Mac Catalina 10.15.4 64bit machine. I am able to run "python setup.py" but when i use "python run_basic_analysis.py" i get below message
****/bioinformatics_tools/tcr-dist-master/external/blast-2.2.16/bin/formatdb -p F -i ***/bioinformatics_tools/tcr-dist-master/db/alphabeta_db.tsv_files/blast_dbs/nucseq_human_A_V.fasta
sh: ***/bioinformatics_tools/tcr-dist-master/external/blast-2.2.16/bin/formatdb: Bad CPU type in executable
blast db creation failed!
Also i couldn't install sklearn as it only for python3 and above.
Can you please help fix this problem?
The use of --no_probabilities is currently limited by the fact that compute_probs imports tcr_rearrangement_new.py or tcr_rearrangement.py (both directly and also through the importing of tcr_sampler.py). the rearrangement scripts have functions that require the existence of hard-coded probabilities files. Currently, --no_probabilities will successfully assign a probability of 1 in all cases, but the pipeline will fail in cases in which the probabilities files do not exist.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.