Comments (7)
And on repeating above, trying different file names, I get no output at all.. no hits.
CHROMEISTER -query cptest2.fasta -db cptest2copy.fasta -out cpdotplot.mat && Rscript /g/data/nm31/bin/chromeister/bin/compute_score.R cpdotplot.mat 1000
[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
100%...[INFO] Database loaded and of length 374340.
[INFO] Ratios: Q [3.743400e+02] D [3.743400e+02]. Lenghts: Q [374340] D [374340]
[INFO] Pixel size: Q [2.671368e-03] D [2.671368e-03].
[INFO] Computing absolute hit numbers.
100%...Scanning hits table.
99%...
[INFO] Query length 374340.
[INFO] Writing matrix.
[INFO] Found 0 unique hits for z = 4.
0.996
from chromeister.
Sorry, have just noticed, no issues resolved since June 2022. I guess this program is no longer supported.
from chromeister.
Hello @theo-allnutt-bioinformatics !
Unfortunately, as you mentioned, the program is no longer maintained. However I had some time to look at the problem you are reporting.
It is wierd that no synteny at all is found, but it could be possible. Your execution of the fasta cptest2.fasta
shows that 0 unique hits are found and that is why the plot is empty. Could it be that cptest2.fasta
was generated artificially?
The problem is that CHROMEISTER is a highly heuristic method that relies on finding small and unique words, which are called unique hits (by default a combination of a fixed 12-mer and its heuristic value). If, for example, a fasta file contained only multiple repeated substrings, then no unique hits could be found.
I did not expect this to be the case, but confirmed it by slightly mutating you cptest2.fasta
and then running it against itself. If you want to try, you can use this AWK script to mutate at a rate of 1%:
awk 'BEGIN{val[1]="A"; val[2]="C"; val[3]="G"; val[4]="T"; } /^>/{printf ("%s\n", $0);} /^[^>]/{split($0, arr, ""); for(i=1; i<length(arr); i+=1) if (rand() > 0.01) printf("%c", arr[i]); else printf("%c", val[ int( rand() * 4 + 1)]); printf("\n");}' cptest2.fasta.txt > mutated.fasta
(The script will take all nucleotides and replace approx. 1 in every 100 with a different one)
The new mutated.fasta
file has now unique hits since some words, even if they were highly repeated in the original file, now contain a single point mutation that is detected as a unique hit. See:
[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
99%...[INFO] Database loaded and of length 374332.
[INFO] Ratios: Q [3.743320e+02] D [3.743320e+02]. Lenghts: Q [374332] D [374332]
[INFO] Pixel size: Q [2.671425e-03] D [2.671425e-03].
[INFO] Computing absolute hit numbers.
99%...Scanning hits table.
99%...
[INFO] Query length 374332.
[INFO] Writing matrix.
[INFO] Found 2046 unique hits for z = 4.
0.016
Note the 2046
unique hits. When plotting, it looks like the following:
Where only a diagonal synteny is observed since we are comparing one sequence against itself, and by definition CHROMEISTER removes all possible repeats.
If you suspect that your original plant genomes contain only repeated substrings, then you can try to use the mutation trick (although for a 1 GB file the AWK command can take ~20 min) and then rerun it through CHROMEISTER. But this sounds like a long shot to me - are you completely sure that there should be a large synteny? How many "unique hits" does CHROMEISTER report on them?
I hope this is helpful!
Esteban
from chromeister.
Hi,
thanks for your reply. The test data consists of direct and inverted long repeats (as most chloroplast and mitochondrial genomes do), so the dot plots should show this. I understand your explanation, but unfortunately this means that your program is not usable for me. I think that it may also be a large problem for other users - it is quite frequent to find large, perfectly, or almost perfectly identical synteny in genomes - for examples in artificially generated polyploids or induced mutants. The inability to detect these syntenies is a major issue. I think that you should perhaps point this out at the top of your readme / manual, so that potentially costly mistakes are not made. For example, I had already run chromeister on several genomes and found no synteny. I will now have to repeat that work. It might also be advisable to contact the publishing journal and add an addendum to the paper.
Thanks again for your help,
Theo Allnutt
from chromeister.
Also:
I think it is important to show unmasked repeats when required - because it is not possible to distinguish these from synteny. The best solution is to allow users to first mask the repeats they are not interested in (e.g. TEs) then look for synteny on the remaining (including repeats within contigs). If you do not make it clear that this program will only work with 'evolved' synteny on different molecules, then others may make the same mistake as I did and assume no synteny was present.
from chromeister.
Hello @theo-allnutt-bioinformatics !
Thanks for your reply. Could you suggest two large, publicly available chloroplast / mitochondrial genomes (e.g. NCBI)? I would like to try something out that could maybe improve CHROMEISTER, and if it does not, then at last I can add a warning and a specific example.
Thanks
Esteban
from chromeister.
Hi,
I think you may have misunderstood my point. Although I used a chloroplast genome in my example, it could be any genome with a repeat. You could make an artificial molecule (or take any sequence) and directly copy it into a tandem or inverted repeat, and in its current form, CHROMEISTER would not detect it, as shown in my example.
Here is another example. The sequence file is:
MT726210.1 Brassica rapa chloroplast, complete genome
NC_028272.1 Brassica juncea chloroplast, complete genome
KT581449.1 Brassica juncea chloroplast, complete genome
brassica_chloroplast.fasta.txt
These two species are closely related, so direct synteny should be visible. And obviously the two B. juncea chloroplast genomes should be even more so. We should also be able to see the long inverted repeats, which most chloroplasts have. However, the dot plot is 'grey':
I hope this illustrates the problem.
Theo
from chromeister.
Related Issues (20)
- remove the [1] and null device from the stdout of both R scripts
- Comparison of large genomes.. messy plot HOT 9
- Sorting the plot from chromeister HOT 3
- compute_score.r error HOT 2
- get the coordinate of synteny block HOT 3
- parallel analysis HOT 1
- Interpretation of scores HOT 1
- coordinates from the events file HOT 4
- Get plots in pdf or svg formats? HOT 4
- Breakpoint about inversion HOT 2
- Problem with run_and_plot_chromeister.sh
- error cannot open file 'dotplot.mat.csv': No such file or directory HOT 11
- Bad Install due opencv-python HOT 1
- error of removal of "index-refseq-qryseq.csv" HOT 1
- Whole genome comparisons. HOT 1
- bring chromeister to bioconda and update galaxy tool HOT 20
- Error in axis - no locations are finite HOT 2
- Look into the dotplot matrix format in galaxy
- Remove empty lines before she bang lines
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chromeister.