Comments (6)
TCRdist is using allele-level information to calculate probabilities of generation, among other things. So at the stage of processing a parsed_seqs_file, it's looking for chains that can be explicitly found in the current version of the annotation database, which is found in /tcr-dist/db/alphabeta_db.tsv This is one of the reasons it's generally best to parse/annotate your sequences using the TCRdist pipeline so that you aren't introducing different segment names, etc. Or you can annotate your data using whatever other pipeline you have, but use the TCRdist VDJB database for those annotations. If you can't do that, you can try editing the database file and then using the option --no_probabilities (otherwise I would expect your probabilities to be incorrect.)
Having said all of that... I don't quite understand what you mean by "I currently only have V gene-level information". The example data you posted includes J gene information as well.
Feel free to email me directly if you want to discuss further how to get an unsupported data format into TCRdist. It is possible to do, but it's important to take into consideration that some of the standard analyzes presented in the output are going to be biased as a result.
from tcr-dist.
Hi Gabrielle,
What has worked for me in the past is to just add "*01" to all the gene names. There may still be a few that cause trouble, but I would give that a shot for starters. We are in the process of doing some refactoring which should help with this kind of I/O trouble, so I bet we can get this figured out.
Let me know whether that fix works for you.
Take care,
Phil
from tcr-dist.
Hi,
Following on from this question, can you give us an example of how to use parsed seq file with the minimum necessary information in order to run? For example with just the CDR3 sequence, V and J gene - will this work?
Otherwise some more information on the minimum input columns needed to successfully run tcr-dist with limited info.
Many thanks,
Alex
from tcr-dist.
Hi Alex,
Sorry that things aren't clearer! This (ie, not starting from the beginning with complete read nucleotide sequences) is definitely one area that needs work. Just to clarify: do you have the nucleotide sequence of the CDR3 regions? Some of the scripts use that information for the graphics. But it's certainly not necessary for all of them...
Take care,
Phil
from tcr-dist.
Hi Phil,
No worries, it's an ambitious job.
I had moderate success with my raw output "target" sequences from MiXCR, however at short sequence lengths tcr-dist failed to find/classify the CDR3 (where MiXCR seems to be building the CDR3 sequence).
I'll keep playing with the CDR3 nucleotide sequences as you have suggested, then see what I can get from my raw MiXCR data.
Ultimately as I have the V, J and CDR3 I should be able to "build" a raw sequence which I can input directly into the paired_seq_file (using fake_probabilities and fake_beta or fake_alpha).
If I have success I will put the details here.
Many thanks,
Alex
from tcr-dist.
Hi Gabrielle,
What has worked for me in the past is to just add "*01" to all the gene names. There may still be a few that cause trouble, but I would give that a shot for starters. We are in the process of doing some refactoring which should help with this kind of I/O trouble, so I bet we can get this figured out.
Let me know whether that fix works for you.
Take care,
Phil
Was this refactoring introduced? I am faced with the same challenge with 10X data processed using the V(D)J T Cell Analysis with cellranger vdj pipeline in addition to mixcr software. Adding *01 to all genes presents a challenge due to the large dataset and presence of genes like TRAJ43;TRAJ34 or TRAV29/DV5 in the same clonotype. Not sure about the overall effect of editing /tcr-dist/db/alphabeta_db.tsv.
from tcr-dist.
Related Issues (20)
- Limited functionality of --no_probabilities
- Problems determining best_gappos HOT 2
- Add universal newline support HOT 2
- Feature request: save PCs from kPCA to log HOT 1
- Track the clonality of out of frame CDR3s HOT 2
- Limitations in detecting out of frame TCRs HOT 3
- Single Chain parsed_seq Input Generating Blank Output File HOT 1
- Validation Data Missing Epitope Label HOT 2
- Amino Acid sequence inputs HOT 7
- AssertionError HOT 4
- How to pair the alpha and beta chain reads from one TCR
- make_really_tall_trees.py script always running HOT 1
- text format cluster output for make_tall_trees.py HOT 2
- make_10x_clones_file.py issue HOT 2
- Bad CPU type HOT 4
- Estimation of TCRdiv HOT 8
- setup_gammadelta_db.py issue HOT 2
- usage with mixcr files HOT 8
- Issue with run_basic_analysis
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tcr-dist.