Giter Site home page Giter Site logo

Comments (7)

theo-allnutt-bioinformatics avatar theo-allnutt-bioinformatics commented on July 19, 2024

And on repeating above, trying different file names, I get no output at all.. no hits.

CHROMEISTER -query cptest2.fasta -db cptest2copy.fasta -out cpdotplot.mat && Rscript /g/data/nm31/bin/chromeister/bin/compute_score.R cpdotplot.mat 1000

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
100%...[INFO] Database loaded and of length 374340.
[INFO] Ratios: Q [3.743400e+02] D [3.743400e+02]. Lenghts: Q [374340] D [374340]
[INFO] Pixel size: Q [2.671368e-03] D [2.671368e-03].
[INFO] Computing absolute hit numbers.
100%...Scanning hits table.
99%...
[INFO] Query length 374340.
[INFO] Writing matrix.
[INFO] Found 0 unique hits for z = 4.
0.996

from chromeister.

theo-allnutt-bioinformatics avatar theo-allnutt-bioinformatics commented on July 19, 2024

Sorry, have just noticed, no issues resolved since June 2022. I guess this program is no longer supported.

from chromeister.

estebanpw avatar estebanpw commented on July 19, 2024

Hello @theo-allnutt-bioinformatics !

Unfortunately, as you mentioned, the program is no longer maintained. However I had some time to look at the problem you are reporting.

It is wierd that no synteny at all is found, but it could be possible. Your execution of the fasta cptest2.fasta shows that 0 unique hits are found and that is why the plot is empty. Could it be that cptest2.fasta was generated artificially?

The problem is that CHROMEISTER is a highly heuristic method that relies on finding small and unique words, which are called unique hits (by default a combination of a fixed 12-mer and its heuristic value). If, for example, a fasta file contained only multiple repeated substrings, then no unique hits could be found.

I did not expect this to be the case, but confirmed it by slightly mutating you cptest2.fasta and then running it against itself. If you want to try, you can use this AWK script to mutate at a rate of 1%:

awk 'BEGIN{val[1]="A"; val[2]="C"; val[3]="G"; val[4]="T"; } /^>/{printf ("%s\n", $0);} /^[^>]/{split($0, arr, ""); for(i=1; i<length(arr); i+=1) if (rand() > 0.01) printf("%c", arr[i]); else printf("%c", val[ int( rand() * 4 + 1)]); printf("\n");}' cptest2.fasta.txt > mutated.fasta

(The script will take all nucleotides and replace approx. 1 in every 100 with a different one)

The new mutated.fasta file has now unique hits since some words, even if they were highly repeated in the original file, now contain a single point mutation that is detected as a unique hit. See:

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
99%...[INFO] Database loaded and of length 374332.
[INFO] Ratios: Q [3.743320e+02] D [3.743320e+02]. Lenghts: Q [374332] D [374332]
[INFO] Pixel size: Q [2.671425e-03] D [2.671425e-03].
[INFO] Computing absolute hit numbers.
99%...Scanning hits table.
99%...
[INFO] Query length 374332.
[INFO] Writing matrix.
[INFO] Found 2046 unique hits for z = 4.
0.016

Note the 2046 unique hits. When plotting, it looks like the following:

imagen

Where only a diagonal synteny is observed since we are comparing one sequence against itself, and by definition CHROMEISTER removes all possible repeats.

If you suspect that your original plant genomes contain only repeated substrings, then you can try to use the mutation trick (although for a 1 GB file the AWK command can take ~20 min) and then rerun it through CHROMEISTER. But this sounds like a long shot to me - are you completely sure that there should be a large synteny? How many "unique hits" does CHROMEISTER report on them?

I hope this is helpful!
Esteban

from chromeister.

theo-allnutt-bioinformatics avatar theo-allnutt-bioinformatics commented on July 19, 2024

Hi,

thanks for your reply. The test data consists of direct and inverted long repeats (as most chloroplast and mitochondrial genomes do), so the dot plots should show this. I understand your explanation, but unfortunately this means that your program is not usable for me. I think that it may also be a large problem for other users - it is quite frequent to find large, perfectly, or almost perfectly identical synteny in genomes - for examples in artificially generated polyploids or induced mutants. The inability to detect these syntenies is a major issue. I think that you should perhaps point this out at the top of your readme / manual, so that potentially costly mistakes are not made. For example, I had already run chromeister on several genomes and found no synteny. I will now have to repeat that work. It might also be advisable to contact the publishing journal and add an addendum to the paper.

Thanks again for your help,

Theo Allnutt

from chromeister.

theo-allnutt-bioinformatics avatar theo-allnutt-bioinformatics commented on July 19, 2024

Also:

I think it is important to show unmasked repeats when required - because it is not possible to distinguish these from synteny. The best solution is to allow users to first mask the repeats they are not interested in (e.g. TEs) then look for synteny on the remaining (including repeats within contigs). If you do not make it clear that this program will only work with 'evolved' synteny on different molecules, then others may make the same mistake as I did and assume no synteny was present.

from chromeister.

estebanpw avatar estebanpw commented on July 19, 2024

Hello @theo-allnutt-bioinformatics !

Thanks for your reply. Could you suggest two large, publicly available chloroplast / mitochondrial genomes (e.g. NCBI)? I would like to try something out that could maybe improve CHROMEISTER, and if it does not, then at last I can add a warning and a specific example.

Thanks
Esteban

from chromeister.

theo-allnutt-bioinformatics avatar theo-allnutt-bioinformatics commented on July 19, 2024

Hi,

I think you may have misunderstood my point. Although I used a chloroplast genome in my example, it could be any genome with a repeat. You could make an artificial molecule (or take any sequence) and directly copy it into a tandem or inverted repeat, and in its current form, CHROMEISTER would not detect it, as shown in my example.

Here is another example. The sequence file is:

MT726210.1 Brassica rapa chloroplast, complete genome
NC_028272.1 Brassica juncea chloroplast, complete genome
KT581449.1 Brassica juncea chloroplast, complete genome

brassica_chloroplast.fasta.txt

These two species are closely related, so direct synteny should be visible. And obviously the two B. juncea chloroplast genomes should be even more so. We should also be able to see the long inverted repeats, which most chloroplasts have. However, the dot plot is 'grey':

cpdotplot mat filt

I hope this illustrates the problem.

Theo

from chromeister.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.