Hi, I was running CHROMEISTER on a plant genome against itself (~1 G

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Not finding any synteny in test data about chromeister HOT 7 OPEN

estebanpw commented on July 19, 2024

Not finding any synteny in test data

from chromeister.

Comments (7)

theo-allnutt-bioinformatics commented on July 19, 2024

And on repeating above, trying different file names, I get no output at all.. no hits.

CHROMEISTER -query cptest2.fasta -db cptest2copy.fasta -out cpdotplot.mat && Rscript /g/data/nm31/bin/chromeister/bin/compute_score.R cpdotplot.mat 1000

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
100%...[INFO] Database loaded and of length 374340.
[INFO] Ratios: Q [3.743400e+02] D [3.743400e+02]. Lenghts: Q [374340] D [374340]
[INFO] Pixel size: Q [2.671368e-03] D [2.671368e-03].
[INFO] Computing absolute hit numbers.
100%...Scanning hits table.
99%...
[INFO] Query length 374340.
[INFO] Writing matrix.
[INFO] Found 0 unique hits for z = 4.
0.996

from chromeister.

theo-allnutt-bioinformatics commented on July 19, 2024

Sorry, have just noticed, no issues resolved since June 2022. I guess this program is no longer supported.

from chromeister.

estebanpw commented on July 19, 2024

Hello @theo-allnutt-bioinformatics !

Unfortunately, as you mentioned, the program is no longer maintained. However I had some time to look at the problem you are reporting.

It is wierd that no synteny at all is found, but it could be possible. Your execution of the fasta cptest2.fasta shows that 0 unique hits are found and that is why the plot is empty. Could it be that cptest2.fasta was generated artificially?

The problem is that CHROMEISTER is a highly heuristic method that relies on finding small and unique words, which are called unique hits (by default a combination of a fixed 12-mer and its heuristic value). If, for example, a fasta file contained only multiple repeated substrings, then no unique hits could be found.

I did not expect this to be the case, but confirmed it by slightly mutating you cptest2.fasta and then running it against itself. If you want to try, you can use this AWK script to mutate at a rate of 1%:

awk 'BEGIN{val[1]="A"; val[2]="C"; val[3]="G"; val[4]="T"; } /^>/{printf ("%s\n", $0);} /^[^>]/{split($0, arr, ""); for(i=1; i<length(arr); i+=1) if (rand() > 0.01) printf("%c", arr[i]); else printf("%c", val[ int( rand() * 4 + 1)]); printf("\n");}' cptest2.fasta.txt > mutated.fasta

(The script will take all nucleotides and replace approx. 1 in every 100 with a different one)

The new mutated.fasta file has now unique hits since some words, even if they were highly repeated in the original file, now contain a single point mutation that is detected as a unique hit. See:

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
99%...[INFO] Database loaded and of length 374332.
[INFO] Ratios: Q [3.743320e+02] D [3.743320e+02]. Lenghts: Q [374332] D [374332]
[INFO] Pixel size: Q [2.671425e-03] D [2.671425e-03].
[INFO] Computing absolute hit numbers.
99%...Scanning hits table.
99%...
[INFO] Query length 374332.
[INFO] Writing matrix.
[INFO] Found 2046 unique hits for z = 4.
0.016

Note the 2046 unique hits. When plotting, it looks like the following:

Where only a diagonal synteny is observed since we are comparing one sequence against itself, and by definition CHROMEISTER removes all possible repeats.

If you suspect that your original plant genomes contain only repeated substrings, then you can try to use the mutation trick (although for a 1 GB file the AWK command can take ~20 min) and then rerun it through CHROMEISTER. But this sounds like a long shot to me - are you completely sure that there should be a large synteny? How many "unique hits" does CHROMEISTER report on them?

I hope this is helpful!
Esteban

from chromeister.

theo-allnutt-bioinformatics commented on July 19, 2024

Hi,

thanks for your reply. The test data consists of direct and inverted long repeats (as most chloroplast and mitochondrial genomes do), so the dot plots should show this. I understand your explanation, but unfortunately this means that your program is not usable for me. I think that it may also be a large problem for other users - it is quite frequent to find large, perfectly, or almost perfectly identical synteny in genomes - for examples in artificially generated polyploids or induced mutants. The inability to detect these syntenies is a major issue. I think that you should perhaps point this out at the top of your readme / manual, so that potentially costly mistakes are not made. For example, I had already run chromeister on several genomes and found no synteny. I will now have to repeat that work. It might also be advisable to contact the publishing journal and add an addendum to the paper.

Thanks again for your help,

Theo Allnutt

from chromeister.

theo-allnutt-bioinformatics commented on July 19, 2024

Also:

I think it is important to show unmasked repeats when required - because it is not possible to distinguish these from synteny. The best solution is to allow users to first mask the repeats they are not interested in (e.g. TEs) then look for synteny on the remaining (including repeats within contigs). If you do not make it clear that this program will only work with 'evolved' synteny on different molecules, then others may make the same mistake as I did and assume no synteny was present.

from chromeister.

estebanpw commented on July 19, 2024

Hello @theo-allnutt-bioinformatics !

Thanks for your reply. Could you suggest two large, publicly available chloroplast / mitochondrial genomes (e.g. NCBI)? I would like to try something out that could maybe improve CHROMEISTER, and if it does not, then at last I can add a warning and a specific example.

Thanks
Esteban

from chromeister.

theo-allnutt-bioinformatics commented on July 19, 2024

Hi,

I think you may have misunderstood my point. Although I used a chloroplast genome in my example, it could be any genome with a repeat. You could make an artificial molecule (or take any sequence) and directly copy it into a tandem or inverted repeat, and in its current form, CHROMEISTER would not detect it, as shown in my example.

Here is another example. The sequence file is:

MT726210.1 Brassica rapa chloroplast, complete genome
NC_028272.1 Brassica juncea chloroplast, complete genome
KT581449.1 Brassica juncea chloroplast, complete genome

brassica_chloroplast.fasta.txt

These two species are closely related, so direct synteny should be visible. And obviously the two B. juncea chloroplast genomes should be even more so. We should also be able to see the long inverted repeats, which most chloroplasts have. However, the dot plot is 'grey':

I hope this illustrates the problem.

Theo

from chromeister.

Not finding any synteny in test data about chromeister HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent