Giter Site home page Giter Site logo

estebanpw / chromeister Goto Github PK

View Code? Open in Web Editor NEW
39.0 4.0 4.0 949 KB

A dotplot generator for large chromosomes

License: GNU General Public License v3.0

R 32.79% C 49.06% Makefile 0.26% Shell 13.49% Python 4.39%
dotplot chromosomes genome genome-comparisons genomics sequence-alignment plants fast lightweight

chromeister's Introduction

install with bioconda European Galaxy server

chromeister

An ultra fast, heuristic approach to detect conserved signals in extremely large pairwise genome comparisons.

Requirements

GCC compiler (any version that is not completely outdated should do), the R programming language (base installation) and python3 (tested on 3.5 and 3.8). Please make sure that on your linux CC resolves to GCC, otherwise it might not work.

Simply download the .zip and unzip it, or clone the repository. Then issue the following command:

cd chromeister && make all -C src/ && python3 -m venv chromeisterenv && source chromeisterenv/bin/activate && pip install -r src/requirements.txt

This will compile CHROMEISTER and create a virtualenv where the python libraries will be installed (see src/requirements.txt)

If the installation finished without errors, you are ready to go! If you encounter the following problem ImportError: No module named 'skbuild' you might need to do source chromeisterenv/bin/activate && pip install --upgrade pip and then run pip install -r src/requirements.txt to finish the installation.

NOTE: python and its libraries are only used for the detection of events. If the binaries compile (i.e. the make all) then you can still run CHROMEISTER and plot the results, even if the python installation did not work.

Use

There are several ways in which CHROMEISTER can be used. The simplest one is to run a 1-vs-1 comparison and then compute the score and the plot. To do so, use the binaries at the bin folder:

Simple execution

You can run CHROMEISTER directly by issuing:

CHROMEISTER -query seqX -db seqY -out dotplot.mat && Rscript compute_score.R dotplot.mat 1000

If you do not want a grid on the output dotplot (which is recommended when running comparisons with a lot of scaffolds for instance) then run the same command but replace compute_score by compute_score-nogrid, see below:

CHROMEISTER -query seqX -db seqY -out dotplot.mat && Rscript compute_score-nogrid.R dotplot.mat 1000

The 1000 value is the default size of dotplot.mat, i.e. the resolution of the matrix -- if you want to change this (for example to generate a larger image (if you use 2000 it will generate a plot of 2000x2000, so be careful) include also the parameter -dimension in CHROMEISTER. Example command with larger resolution:

CHROMEISTER -query seqX -db seqY -out dotplot.mat -dimension 2000 && Rscript compute_score.R dotplot.mat 2000

And if you want to run the events detection, use (make sure that your virtualenv chromeisterenv is in the chromeister root folder:

source ../chromeisterenv/bin/activate && python3 bin/detect_events.py dotplot.mat.raw.txt

This will generate a dotplot.mat.events.txt file containing the detected events and classified. If you want to get a plot of the signal with the overlapped detected events, issue the same command but add at the end the parameter png (separated by a space).

You can also use the script that is in the bin folder (which will do all of the above for you):

run_and_plot_chromeister.sh (input sequence A) (input sequence B) (KMER size) (DIMENSION of plot) (inexactitude level) [optional: grid]

(see parameters at the end) (the grid keyword at the end can be included/omitted depending if you want grid in the output dotplot)

This will generate the following items:

  • Comparison matrix, i.e. a scaled matrix containing the unique and inexact hits
  • Plot of the comparison with the automatic scoring distance and grid separating different sequences (chromosomes for instance)
  • CSV file containing the coordinates of each sequence/chromosome contained within the query and the reference
  • Events file. A text file where each row is a synteny block. Note: these events are Large-Scale Genome Rearrangements heuristically determined and classified as {Synteny block, transposition, inversion, ...} - but this is only an informative labelling that only considers coordinates - do not blindly believe in the classification, but rather do your own labelling based on the events.
  • Guides to be used in an exhaustive GECKO comparison (reduces runtime)

All vs All execution

You can run massive all versus all comparisons in two diferent ways:

  • Comparing all the sequences in one folder. This accounts for 1/2 * n * (n+1) comparisons, hence it will not compare sequence B to sequence A if the comparison for sequence A to sequence B already existed.

    • To run this mode, use the script in the bin folder: allVsAll.sh <sequences folder> <extension (e.g. fasta)> <matrix size (1000 for chromosomes, 2000 for full genomes)> <kmer size 32 (32 is best)> <inexactitude level (4 is recommended)>
  • Comparing two folders containing sequences. This accounts for n * m comparisons, therefore it will compare ALL to ALL. Use this for instance to compare all chromosomes of one genome to all chromosomes of another genome.

    • To run this mode, use the script in the bin folder: allVsAll_incremental.sh <sequences folder 1> <sequences folder 2> <extension (e.g. fasta)> <matrix size (1000 for chromosomes, 2000 for full genomes)> <kmer size 32 (32 is best)> <inexactitude level (4 is recommended)>

At the end of both comparisons, an index will be created summarizing the scores per each comparison. This index has the following format (see header and example below): header: <SpX, SpY, IDX, IDY, IMG, CHNumberX, CHNumberY, Score, LengthX, LengthY> example: BRAOL.Chr.C1,BRAOL.Chr.C2,>C1 dna:chromosome chromosome:v2.1:C1:1:43764888:1 REF,>C2 dna:chromosome chromosome:v2.1:C2:1:52886895:1 REF,BRAOL.C hr.C1.fasta-BRAOL.Chr.C2.fasta.mat.filt.png,C1,C2, 0.996,43764888,52886895

Notice that you can easily run this in parallel by just re-issuing the command (i.e. execute same command as many times as you want, each time another core will help in the processing).

Converting CHROMEISTER signal into alignments

First of all, consider whether it is interesting or not to use CHROMEISTER for "fine-grained" results. CHROMEISTER is recommended for VERY coarse-grained and full-genome comparisons in order to quickly assess similarity between genomes. Thus it does NOT produce alignments. However, if you find yourself in a situation where you want to convert the signal of CHROMEISTER into alignments (e.g. two large genomes), this can be done. The following tutorial shows how to do it, with human chromosome X and mouse chromosome X as example:

  1. First, run CHROMEISTER like this:

    ./CHROMEISTER -query HOMSA.Chr.X.fasta -db MUSMU.Chr.X.fasta -out dotplot.mat -dimension 1000 && Rscript compute_score.R dotplot.mat 1000

  2. Check the "dotplot.mat.filt.png" corresponding to the dotplot between both chromosomes to see if there is any similarity. If so, proceed to next step.

  3. Clone the following repository: https://github.com/estebanpw/gecko

    git clone https://github.com/estebanpw/gecko

  4. Switch branch to the one named "inmemory_guided_chrom" and compile it. To do so, issue the following command:

    cd gecko && git checkout inmemory_guided_chrom && make all -C src

  5. Now run the script "guidefastas" in the bin folder. See below:

    bin/guidefastas.sh HOMSA.Chr.X.fasta MUSMU.Chr.X.fasta hits-XY-dotplot.mat.hits 1000 100 60 32

    You can add the following arguments to the execution:

  • --alignments : This one will extract the alignments and write them to a file with extension .alignments in the all-results folder.

  • --names : This one will output the names of the sequences to which each fragment belongs instead of their sequence number (e.g. if comparing chromosomes 1 and 2 of Homo sapiens vs Mus musculus then 0,1 will now be Homo s. chr1, Mus m. chr2)

  • --sort : This will sort output frags in the csv file according first to the comparison they belong to and secondly by coordinates.

  • --local : This one will convert the coordinates in the csv file (which are global in respect to the file) to local in respect to each sequence in the fasta file. Useful for chromosome comparison when we do not want all coordinates to accumulate.

    Note (1): remember to include the full path to the sequences. Note (2): the "hits-XY-dotplot.mat.hits" file is produced by CHROMEISTER in step 1. Copy it to the folder or include full path. Note (3): the parameters following in the command "1000 200 75 32" are namely (1) size of dotplot, (2) minimum length that an alignment must have to be reported, (3) minimum similarity from 0-100, (4) k-mer seed size (use 32 for chromomsome-like sequences).

    This step can take several minutes, e.g. using 1 CPU this execution took around 9-10 minutes.

  1. A CSV file containing the alignments coordinates can be found in the folder all-results. You can download it here if you wish to do so: http://mango.ac.uma.es/compartir/HOMSA_X-MUSMU_X.csv

  2. If you also wish to visually contrast annotations to the alignments, you can use our genomic browser at https://pistacho.ac.uma.es/. To do so just follow the user guide available at https://pistacho.ac.uma.es/static/data/GeckoMGV-UserGuide.pdf

  3. An example alignment file is shown below:

AAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAGAAAGAAAGAAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAGAAAGAAAA
||||||||||||||||||||||||||||||||||| | |||  ||  || ||| ||| |||||||||||||||||||||||||||||||||| | | | ||
AAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAAGAAAGAAGGAAGGAAGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA
@ FORWARD STRAND x1: 10501385 y1: 169863509 x2: 10501485 y2: 169863609 Identity: 88/101 (87.1287%)
TTTTCCCATTGATTAATATTTTTCCTGTTGAGCAGATGAGAGAAAGCCAAAAAAAGCACAGCTGGGCCATTTCCCCTCACTGGGAACGTCATTTCCAGGCACTTTGTGCTTACTTGAT
|||||||||||||| ||||||||| | |||| | |||||||||||||||||||||||||||||||||||||||| |||||||  |||||||||||||| |||| |||||| |||||||
TTTTCCCATTGATTGATATTTTTCTTATTGAACTGATGAGAGAAAGCCAAAAAAAGCACAGCTGGGCCATTTCCTCTCACTGTAAACGTCATTTCCAGTCACTCTGTGCTCACTTGAT
@ REVERSE STRAND x1: 10451224 y1: 169981322 x2: 10451341 y2: 169981205 Identity: 107/118 (90.678%)

Note that each fragment begins with an alignment, and has a corresponding @ line where information is included such as strand of the alignments, coordinates, identity, etc.

Parameters

USAGE:

  • -query: sequence A in fasta format
  • -db: sequence B in fasta format
  • -out: output matrix
  • -kmer Integer: k>1 (default 32) Use 32 for chromosomes and genomes and 16 for small bacteria
  • -diffuse Integer: z>0 (default 4) Use 4 for everything - if using large plant genomes you can try using 1
  • -dimension Size of the output matrix and plot. Integer: d>0 (default 1000) Use 1000 for everything that is not full genome size, where 2000 is recommended

Test data

You can test CHROMEISTER with the two mycoplasma sequences provided in the 'test-data' folder. You can do so by running the following commands (from within the test-data folder):

../bin/CHROMEISTER -query mycoplasma-232.fasta -db mycoplasma-7422.fasta -out mycoplasma-232-7422.mat -dimension 500 Rscript ../bin/compute_score.R mycoplasma-232-7422.mat 500

Note: in this example we used size 500 since the two sequences are quite small.

Example runs

Chromosome example

Comparing two chromosomes (Homo sapiens Chr X vs Mus musculus Chr X) in a minute:

run_and_plot_chromeister.sh HOMSA.Chr.X.fasta MUSMU.Chr.X.fasta 32 1000 4

HOMSA vs MUSMU

Multi-fasta

Comparing the full genome of the Gallus gallus against the Meleagris gallopavo with all their chromosomes including auto-generated grid:

run_and_plot_chromeister.sh GALGA.Chr.complete.fasta MELGA.Chr.complete.fasta 32 1000 4 grid

MULTI-FASTA

Events detection

From the Homo sapiens Chr X vs Mus musculus Chr X comparison you can inspect the file HOMSA.Chr.X.fasta-MUSMU.Chr.X.fasta.mat.events.txt which contains the classified events. You can also plot the overlap between detected events and original plot by doing:

python3 chromeister/bin/detect_events.py HOMSA.Chr.X.fasta-MUSMU.Chr.X.fasta.mat.raw.txt png

which will generate the following plot:

DETECTED-EVENTS

Notice that the orange color represents detected events whereas green represents undetected blocks. The detection is based on the Houghs transform and therefore there are some parameters that can be used to change the minimum length to detect an event (e.g. if we want single points as well) or the minimum gap to join two blocks. These parameters can be changed in the detect_events.py file. The current configuration is mostly tailored to detect larger events and no single points.

Fine-grained run

Comparing some mycoplasma hyopneumoniae genomes (not all has to be mammalian or plant chromosomes!).

SMALL-BACTERIA

This is afterwards ran with the inmemory_guided_chrom branch of GECKO (see here) using the following command line (supply the CHROMEISTER hits file):

gecko/bin/guidefastas.sh NC_014.fasta NC_017.fasta hits-XY-NC_014.fasta-NC_017.fasta.mat.hits 1000 50 60 32 names

and we get a csv NC_014-NC_017.csv which includes:

Frag,1218913,3918445,1221077,3916281,r,0,2165,8644,2163,99.82,1.00,_gi|304372805|ref|NC_014448.1|,_gi|385858114|ref|NC_017519.1|
[...]
Frag,6618091,5316746,6620403,5319058,f,0,2313,8524,2222,92.13,0.96,_gi|321309518|ref|NC_014970.1|,_gi|385858893|ref|NC_017520.1|
[...]
Frag,3062371,6264699,3064642,6266970,f,0,2272,9080,2271,99.91,1.00,_gi|313664890|ref|NC_014751.1|,_gi|392388518|ref|NC_017521.1|

Where the last two columns indicate the origin of the fragment. For instance, in this case, the first fragment belongs to gi|304372805|ref|NC_014448.1| and gi|385858114|ref|NC_017519.1| which are sequences 2 and 5 in the x and y axis in the previous plot, respectively.

Remember that your csv file can be uploaded and interactively inspected here!

Help

  1. Hanging output and program does not finish If you experience this kind of output:

    [INFO] Generating a 1000x1000 matrix

    [INFO] Loading database

    100%...[INFO] Database loaded and of length 70039485.

    [INFO] Ratios: Q [6.658880e+04] D [7.003949e+04]. Lenghts: Q [66588797] D [70039485]

    [INFO] Pixel size: Q [1.501754e-05] D [1.427766e-05].

    [INFO] Computing absolute hit numbers.

    100%...Scanning hits table.

    100%...

    [INFO] Query length 66588797.

    [INFO] Writing matrix.

    [INFO] Found 238693 unique hits for z = 4.

    But the program doesnt finish (it "hangs"), then your input sequences probably contain a lot of sequences (i.e. a multifasta with hundreds of contigs). To fix this, simply run the same command but instead of using the "compute_score.R" use the "compute_score-nogrid.R" script. This will remove the drawing of the grid which can get overflown when using too many sequences.

Citing

If you use or have used CHROMEISTER in your research, please cite the following article:

Esteban Pérez-Wohlfeil, Sergio Diaz-del-Pino and Oswaldo Trelles. "Ultra-fast genome comparison for large-scale genomic experiments." Scientific reports 9, no. 1 (2019): 1-10.

Link to manuscript

chromeister's People

Contributors

bgruening avatar dependabot[bot] avatar estebanpw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

chromeister's Issues

Not finding any synteny in test data

Hi,

I was running CHROMEISTER on a plant genome against itself (~1 Gbp) where I expected to see some synteny between chromosomes, but did not see any. I then tested CHROMEISTER on a chloroplast genome (which has two inverted repeat regions) and again saw nothing. I then decided to copy parts of the cp chromosome into test contigs - both forward and reverse complement of ~5kb. This too showed no synteny, just a dot plot with three gaps (fig1 below). What is the minimum size of a block where you would expect to see synteny (I would have thought 5kb would be large enough?).

I then repeated the test with the whole test sequence reverse complemented and as added as another chromosome:
CHROMEISTER -dimension 2000 -kmer 16 -query cptest2.fasta -db cptest2.fasta -out cpdotplot.mat && Rscript /g/data/nm31/bin/chromeister/bin/compute_score.R cpdotplot.mat 2000

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
100%...[INFO] Database loaded and of length 374340.
[INFO] Ratios: Q [3.743400e+02] D [3.743400e+02]. Lenghts: Q [374340] D [374340]
[INFO] Pixel size: Q [2.671368e-03] D [2.671368e-03].
[INFO] Computing absolute hit numbers.
100%...Scanning hits table.
99%...
[INFO] Query length 374340.
[INFO] Writing matrix.
[INFO] Found 0 unique hits for z = 4.
0.996

..and this produced a 'grey' dotplot, fig2.

fig1
fig2

test data:

cptest2.fasta.txt

So something is not working but I am not getting any errors.

Thanks,

Theo

coordinates from the events file

Hi Esteban,

I'm interested in pulling the specific chromosomal coordinates identified in the events file after running run_and_plot_chromeister.sh. I believe they are reported globally across the genome (the genome has 21 chromosomes in the fasta) in this file, nor does the file specify the chromosome from which they are derived. What is the best way to covert global coordinate to local chromosomal coordinates for the events identified?

Thanks in advance,

Jeff

Bad Install due opencv-python

Hi There,

I followed the install instructions and I encountered the following error on Ubuntu 22.04.1 LTS:

After this:

cd chromeister && make all -C src/ && python3 -m venv chromeisterenv && source chromeisterenv/bin/activate && pip install -r src/requirements.txt

I have the following error:

  /tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:242:9: error: 'PyTuple_GET_SIZE' was not declared in this scope
       if (PyTuple_GET_SIZE(value) != 3) {
           ^~~~~~~~~~~~~~~~
  /tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:242:9: note: suggested alternative: 'PyTuple_GetSlice'
       if (PyTuple_GET_SIZE(value) != 3) {
           ^~~~~~~~~~~~~~~~
           PyTuple_GetSlice
  /tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:245:13: error: 'PyTuple_GET_ITEM' was not declared in this scope
       title = PyTuple_GET_ITEM(value, 2);
               ^~~~~~~~~~~~~~~~
  /tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:245:13: note: suggested alternative: 'PyArray_GETITEM'
       title = PyTuple_GET_ITEM(value, 2);
               ^~~~~~~~~~~~~~~~
               PyArray_GETITEM
  gmake[2]: *** [modules/python3/CMakeFiles/opencv_python3.dir/build.make:76: modules/python3/CMakeFiles/opencv_python3.dir/__/src2/cv2.cpp.o] Error 1
  gmake[1]: *** [CMakeFiles/Makefile2:2077: modules/python3/CMakeFiles/opencv_python3.dir/all] Error 2
  gmake: *** [Makefile:166: all] Error 2
  Traceback (most recent call last):
    File "/tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/skbuild/setuptools_wrap.py", line 640, in setup
      cmkr.make(make_args, install_target=cmake_install_target, env=env)
    File "/tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/skbuild/cmaker.py", line 672, in make
      self.make_impl(clargs=clargs, config=config, source_dir=source_dir, install_target=install_target, env=env)
    File "/tmp/pip-build-env-uj14l2ll/overlay/lib/python3.7/site-packages/skbuild/cmaker.py", line 704, in make_impl
      "An error occurred while building with CMake.\n"

  An error occurred while building with CMake.
    Command:
      cmake --build . --target install --config Release --
    Install target:
      install
    Source directory:
      /tmp/pip-install-pjy3pvde/opencv-python
    Working directory:
      /tmp/pip-install-pjy3pvde/opencv-python/_skbuild/linux-x86_64-3.7/cmake-build
  Please check the install target is valid and see CMake's output for more information.
  ----------------------------------------
  ERROR: Failed building wheel for opencv-python
  Running setup.py clean for opencv-python
Failed to build opencv-python
ERROR: Could not build wheels for opencv-python which use PEP 517 and cannot be installed directly

I checked the requirements for the program and I workaround the following conda recipe for Ubuntu 22.04:

cd --
git clone https://github.com/estebanpw/chromeister.git
cd chromeister
make all -C src/

conda create --name chromeister python=3.6.13         # create environment
conda activate chromeister                            # load environment

# Install packages
conda install -c bioconda -y scikit-build
conda install -c anaconda -y cycler
conda install -c anaconda -y kiwisolver
conda install -c bioconda -y numpy
pip install opencv-python
conda install -c bioconda -y Pillow
conda install -c bioconda -y pyparsing
conda install -c bioconda -y python-dateutil
conda install -c bioconda -y six

In this manner, as an example, I can call chromeister as follows:

conda activate chromeister
/home/carlos/chromeister/bin/CHROMEISTER -query GRCm39.genome.fa -db Mus_musculus.GRCm39.dna.toplevel.fa -out dotplot.mat && Rscript /home/carlos/chromeister/bin/compute_score.R dotplot.mat 1000

and the program is working as follows:

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
99%...[INFO] Database loaded and of length 2728222451.
[INFO] Ratios: Q [2.728222e+06] D [2.728222e+06]. Lenghts: Q [2728222451] D [2728222451]
[INFO] Pixel size: Q [3.665390e-07] D [3.665390e-07].
[INFO] Computing absolute hit numbers.
19%...
79%...
99%...Scanning hits table.
99%...
[INFO] Query length 2728222451.
[INFO] Writing matrix.
[INFO] Found 2234272 unique hits for z = 4.
0

My question is, can I just use the program in this way, or I do need to work on the python-opencv error?

Cheers,

Carlos

Sorting the plot from chromeister

Hi
Is it possible to sort the chromeister plot according to the reference coordinates? I found a thread discussing the same using Mauve but I am dealing with a huge genome and It doesn't work.

Thanks

add parameter to grid drawing

If there are too many sequences in a multifasta (e.g. contigs or scaffolds) the grid will completely eclipse the dotplot.

Error in axis - no locations are finite

Hello,

I ran Chromeister and I get the following error. Could you please help me? Thanks!

./CHROMEISTER -query query.fa -db ref.fa -out dotplot.mat && Rscript compute_score.R dotplot.mat 1000
[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
100%...[INFO] Database loaded and of length 557480303.
[INFO] Ratios: Q [6.709421e+05] D [5.574803e+05]. Lenghts: Q [670942087] D [557480303]
[INFO] Pixel size: Q [1.490442e-06] D [1.793785e-06].
[INFO] Computing absolute hit numbers.
100%...Scanning hits table.
100%...
[INFO] Query length 670942087.
[INFO] Writing matrix.
[INFO] Found 299180 unique hits for z = 4.
[1] 0.996
Error in axis(1, tick = FALSE, labels = seq_x_labels, at = seq_x_ticks) :
no locations are finite
Execution halted

parallel analysis

Hi There,
I am going to compare two large genome assemblies (each ~ 2.6 Gb). I am interested to use allVsAll.sh for this purpose but I have two concerns: 1) I did not separate chromosomes and there are two whole genomes in my genome directory. I wonder if this is a correct way to do such analysis. 2) I would like to use parallel analysis and you mentioned that it can be possible by re-issuing the command but I want to submit the job to server (slurm job), so I wonder how I can handle that.
Here is the my command:
allVsAll.sh /chromeister/genomes/ fasta dim 2000 kmer 32
Thank you for your support.
Karim

Breakpoint about inversion

Hello Esteban,

Many thanks for developing such useful tools.

1). I would like to use  chromeister to check inversion and their breakpoint (first figure).  Does chromeister can highlight or report the breakpoint of inversion in the png? 
2). When I use chromeister to check the small inversion event, I find the figure not like dotpoint(second figure), but dotline. it's because of  k-mer size?  

 Thanks.

inv breakpoint

image

best,
Lin

error of removal of "index-refseq-qryseq.csv"

In addition, there is another small issue where the removal of "index-refseq-qryseq.csv" is not successful, which cause an error.
It seems that that is caused by the directory /path issue.

error information:
Launching... /home/xuewen/Downloads/chromeister-master/bin/index_chromeister_solo.sh . index-refseq-qryseq.csv /home/xuewen/Downloads/chromeister-master/refseq /home/xuewen/Downloads/chromeister-master/qryseq
rm: cannot remove 'index-refseq-qryseq.csv': No such file or directory

Get plots in pdf or svg formats?

Hello Esteban,

I'm experiencing hard times trying to modify the R Script to get vectorized plots to post-edit them. I'm getting over-dimensioned plots and I have no idea about what is going on despite I think is the same code after change the format. Could you help me?

Thank you in advance!

Comparison of large genomes.. messy plot

Dear Esteban,

I used chromeister tool to compare two varieties of olive plants (Each one was over 3 Gbp ) using the command
"./CHROMEISTER -query Genome1.fa -db Genome2.fa -out dotplot.mat -diffuse 1 && Rscript compute_score-nogrid.R dotplot.mat 1000"
Although I got the percentage of similarity the plot was messy enough

Then I run
/CHROMEISTER -query Genome1.fa -db Genome2.fa -out dotplot.mat -diffuse 1 -dimension 2000 && Rscript compute_score-nogrid.R dotplot.mat 2000 But the plot was messy again...

The genomes that I compare have hundreds of contigs, and thus the figure looks messy. Could you provide some instructions on tuning the parameters to generate a better plot ?

Thanks a lot,
Mary

compute_score.r error

I installed the chromeister using conda.

I run the command
CHROMEISTER -query GENOME1.fasta -db GENOME2.fasta -out dotplot.mat -dimension 2000 && Rscript compute_score.R dotplot.mat 2000
but at last I faced issue of
Fatal error: cannot open file 'compute_score.R': No such file or directory

I will be thankful for help.

get the coordinate of synteny block

Hi,

I'm working on the comparison of whole genome sequence for different maize lines.
The genome is large, ~ 2GB per line, and Mummer took me a long tome for each pairwise comparison. I found that chromeister is very fast and I want to apply it to all my comparison.
I need an output format similar as the output by show-coords from Mummer. The output should includes start and end position on the query and target. However, I only got a coordinate as the plot X-Y axis using gecko from chromeister output.
Could you help me to figure out how to get the coordinates based on each chromosome?

Best,
Jing

Problem with run_and_plot_chromeister.sh

Hi,

I am trying to use chrmister for comapring genomes. All these genomes have ~1900 scaffolds. At first for the first pair the run_and_plot_chromeister.sh script worked perfectly well and I got really nice plots at both 1000 and 2000 dims.
Then for the next pair it is giving me the errors. Here is the log:
/apps/chromeister/1.5a/chromeister/bin/run_and_plot_chromeister.sh: line 31: /apps/chromeister/1.5a/chromeister/bin/../chromeisterenv/bin/activate: No such $
Traceback (most recent call last):
File "/apps/chromeister/1.5a/chromeister/bin/detect_events.py", line 15, in
readheader = open(sys.argv[1], "r")
FileNotFoundError: [Errno 2] No such file or directory: 'EW18.fasta-GM1.fasta.mat.raw.txt'
/apps/chromeister/1.5a/chromeister/bin/run_and_plot_chromeister.sh: line 33: deactivate: command not found
Mon Jun 27 03:38:52 EDT 2022

Command in the slurm script is:
module load chromeister/1.5a
module load R/4.2

run_and_plot_chromeister.sh EW18.fasta GM1.fasta 32 1000 4 grid

date

error cannot open file 'dotplot.mat.csv': No such file or directory

$scriptdir/CHROMEISTER -query aradu.K30065.gnm1.chr01.fa -db aradu.K30060.gnm1.chr01.fa -out dotplot.mat -dimension 2000

output: 7.7M
dotplot.mat

error:
$ Rscript $scriptdir/compute_score.R dotplot.mat 2000
Error in file(con, "r") : cannot open the connection
Calls: unlist -> readLines -> file
In addition: Warning message:
In file(con, "r") :
cannot open file 'dotplot.mat.csv': No such file or directory
Execution halted

##it seems that the output "dotplot.mat.csv " is missing

Interpretation of scores

Hi there,
Thank you for developing chromeister. I have used the software for a series of genome comparisons. Firstly, I compared two versions of genome assemblies from the same species. One of them is too fragmentated (~7000 scaffolds) and the other one is chromosome-level assembly with 16 large scaffolds (chr). I expected to see a small score for this comparison but the score value was 0.73. As you mentioned in your manuscript, the scores close to 0 indicates the exact same sequences and 1 indicates absolutely no similarity (if I am right). I wonder if this could be the results of fragmentation in the first genome and if you have any suggestion to improve that. I have also compared the genome assembly of this species (second one) with two other related species and the scores were 0.1 and 0.26, which were reasonable. Both of these species have a chromosome-level assemblies. I would appreciate it if you could guide me in this regard.
Best,
Karim

Whole genome comparisons.

Hi.
Can chromeister be used for comparing whole genome of two mammals such as human and chimp.
if we need to specify/modify certain parameters please mention those as well..

thank you.

mohit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.