hillerlab / make_lastz_chains Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 8.0 1.79 MB

Portable solution to generate genome alignment chains using lastz

License: MIT License

Python 87.33% Perl 12.07% Nextflow 0.60%

bioinformatics bioinformatics-pipeline genomics

make_lastz_chains's People

Contributors

Stargazers

Watchers

Forkers

yuzhenpeng wangchengww tkastylevsky osipovarev shjenkins94 ohdongha maher963 ning-y

make_lastz_chains's Issues

Runs fail at doChainRun step,

I was running alignments for a pair of Drosophila genomes and human vs. mouse. Both runs failed at the doChainRun step.

Commands used:

# for Drosophila pairs
make_chains.py Dme Dsi GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna GCF_016746395.2_Prin_Dsim_3.1_genomic.fna --executor_queuesize 16 --project_dir 220722_LastZ_chain_DmeDsi_1stAttempt > 220722_lsz_DmeDsi_1stAttempt.log 2>&1 
# for human vs. mouse
make_chains.py h38 m39 GCF_000001405.40_GRCh38.p14_genomic.fna GCF_000001635.27_GRCm39_genomic.fna --executor_queuesize 32 --project_dir 220722_LastZ_chain_h38m39_1stAttempt > 220722_lsz_h38m39_1stAttempt.log 2>&1 #2

Attached are log files for both runs.
220722_lsz_h38m39_1stAttempt.log
220722_lsz_DmeDsi_1stAttempt.log

The complaints were that a certain chrom couldn't be found in .chrom.sizes files:

# for Drosophila pairs 
ERROR: file /export/home/TMP/771717715.1.unified/tmp.oHfdzTtayi/NW_001845981.psl was not bundled as the chrom could not be found in ./220722_LastZ_chain_DmeDsi_1stAttempt/Dme.chrom.sizes
# for human vs. mouse
ERROR: file /export/home/TMP/771851000.1.unified/tmp.apDCgOGofk/NC_000015.psl was not bundled as the chrom could not be found in ./220722_LastZ_chain_h38m39_1stAttempt/h38.chrom.sizes

But I can see both of them (with .* suffixes) in *.chrom.sizes.

$ grep "NW_001845981" 220722_LastZ_chain_DmeDsi_1stAttempt/Dme.chrom.sizes
NW_001845981.1  1643
$ grep "NC_000015" 220722_LastZ_chain_h38m39_1stAttempt/h38.chrom.sizes
NC_000015.10    101991189

Will having .* suffixes in sequence names be a problem? Please help.
Thanks!

Renaming sequence in the chain has an issue.

Hi again,
I saw the update to deal with the sequence IDs with "." (per issue #3):

Warning If your scaffold names are numbered, such as NC_00000.1, please consider removing scaffold numbers (rename NC_00000.1 to NC_00000 or NC_00000__1, for example). ... The pipeline will try to trim scaffold numbers automatically to process the data correctly. Afterward, it will rename the scaffolds back.

Since it appears that you implemented a mechanism to deal with the ".*" in sequence IDs, I downloaded the new version of the pipeline and ran it with the input genomes without modifying sequence IDs.

The pipeline started with creating a renamed .fasta file and a .tsv that contains the sequence ID conversion table. And at the end, the log says the pipeline is renaming chromosome names in the chain file.

However, the resulting chain file does not look right. Most of all, the chain score field is gone, making it impossible to salvage the results:

$ zgrep "chain" h38w5.m39w5.allfilled.chain.gz | head -n5 | column -t
chain  NC_000014.9   NC_000014  107043718  +  24687985   NC_000078.7  NC_000078  120092757  +  44683557   114519917  1
chain  NC_000004.12  NC_000004  190214555  +  1103677    NC_000071.7  NC_000071  151758149  +  33338229   105126044  2
chain  NC_000002.12  NC_000002  242193529  +  96496148   NC_000067.7  NC_000067  195154279  +  36305135   93990572   3
chain  NC_000001.11  NC_000001  248956422  +  930892     NC_000070.7  NC_000070  156860686  -  516431     53690008   4
chain  NC_000001.11  NC_000001  248956422  +  68121445   NC_000069.7  NC_000069  159745316  -  100601     72842298   5

For comparison, below is the chain file using a previous version of make_lastz_chains. Here, I supplied the input genomes with sequence IDs pre-modified:

$ zgrep "chain" h38w4.m39w4.all.chain.gz | head -n5 | column -t 
chain  720570302  NC_000014__9   107043718  +  24687985   106824309  NC_000078__7  120092757  +  44683557   114419863  1
chain  598452391  NC_000004__12  190214555  +  1104636    88140396   NC_000071__7  151758149  +  33339160   105126044  2
chain  584782083  NC_000001__11  248956422  +  930892     58547094   NC_000070__7  156860686  -  516431     53690008   3
chain  578674369  NC_000002__12  242193529  +  96496148   241869966  NC_000067__7  195154279  +  36305135   93989195   4
chain  564776354  NC_000001__11  248956422  +  68121445   158184951  NC_000069__7  159745316  -  100601     72842298   5

Is there any other part of the pipeline updated since July 6th, 2022? If not, we will revert back to the older version (make_lastz_chains) and use genome files with the modified sequence IDs. The current version has an issue.

Thanks,
Dong-Ha

Error running make_lastz_chains on Slurm cluster

Hi,

I am attaching an error am getting while running make_lastz_chains on Slurm cluster. Can you please help me to fix the issue.

Thanks
Philge
make_lastz_error.txt

make_chains.py in pipeline: modules.error_classes.PipelineFileNotFoundError: Error! No non-empty files found at make_lastz_chains/test_out/temp_concat_lastz_output. The failed operation label is: cat_step

On RHEL8, nextflow 23.10.0, anaconda/3-2023.09 on a shared HPC cluster, running the below example from Running the pipeline, results in the below. Did I miss a step?

./make_chains.py target query test_data/test_reference.fa test_data/test_query.fa --pd test_out -f --chaining_memory 16
# Make Lastz Chains #
Version 2.0.8
Commit: 187e313afc10382fe44c96e47f27c4466d63e114
Branch: main

* found run_lastz.py at /path/to/me/make_lastz_chains/standalone_scripts/run_lastz.py
* found run_lastz_intermediate_layer.py at /path/to/me/make_lastz_chains/standalone_scripts/run_lastz_intermediate_layer.py
* found chain_gap_filler.py at /path/to/me/make_lastz_chains/standalone_scripts/chain_gap_filler.py
* found faToTwoBit at /path/to/me/make_lastz_chains/HL_kent_binaries/faToTwoBit
* found twoBitToFa at /path/to/me/make_lastz_chains/HL_kent_binaries/twoBitToFa
* found pslSortAcc at /path/to/me/make_lastz_chains/HL_kent_binaries/pslSortAcc
* found axtChain at /path/to/me/make_lastz_chains/HL_kent_binaries/axtChain
* found axtToPsl at /path/to/me/make_lastz_chains/HL_kent_binaries/axtToPsl
* found chainAntiRepeat at /path/to/me/make_lastz_chains/HL_kent_binaries/chainAntiRepeat
* found chainMergeSort at /path/to/me/make_lastz_chains/HL_kent_binaries/chainMergeSort
* found chainCleaner at /path/to/me/make_lastz_chains/HL_kent_binaries/chainCleaner
* found chainSort at /path/to/me/make_lastz_chains/HL_kent_binaries/chainSort
* found chainScore at /path/to/me/make_lastz_chains/HL_kent_binaries/chainScore
* found chainNet at /path/to/me/make_lastz_chains/HL_kent_binaries/chainNet
* found chainFilter at /path/to/me/make_lastz_chains/HL_kent_binaries/chainFilter
* found lastz at /path/to/me/make_lastz_chains/HL_kent_binaries/lastz
* found nextflow at /burg/opt/nextflow/23.10.0/nextflow
All necessary executables found.
Making chains for test_data/test_reference.fa and test_data/test_query.fa files, saving results to /path/to/me/make_lastz_chains/test_out
Pipeline started at 2023-12-14 10:35:02.145466
* Setting up genome sequences for target
genomeID: target
input sequence file: test_data/test_reference.fa
is 2bit: False
planned genome dir location: /path/to/me/make_lastz_chains/test_out/target.2bit
Initial fasta file test_data/test_reference.fa saved to /path/to/me/make_lastz_chains/test_out/target.2bit
For target (target) sequence file: /path/to/me/make_lastz_chains/test_out/target.2bit; chrom sizes saved to: /path/to/me/make_lastz_chains/test_out/target.chrom.sizes
* Setting up genome sequences for query
genomeID: query
input sequence file: test_data/test_query.fa
is 2bit: False
planned genome dir location: /path/to/me/make_lastz_chains/test_out/query.2bit
Initial fasta file test_data/test_query.fa saved to /path/to/me/make_lastz_chains/test_out/query.2bit
For query (query) sequence file: /path/to/me/make_lastz_chains/test_out/query.2bit; chrom sizes saved to: /path/to/me/make_lastz_chains/test_out/query.chrom.sizes

### Partition Step ###

# Partitioning for target
Saving partitions and creating 1 buckets for lastz output
In particular, 0 partitions for bigger chromosomes
And 1 buckets for smaller scaffolds
Saving target partitions to: /path/to/me/make_lastz_chains/test_out/target_partitions.txt
# Partitioning for query
Saving partitions and creating 1 buckets for lastz output
In particular, 0 partitions for bigger chromosomes
And 1 buckets for smaller scaffolds
Saving query partitions to: /path/to/me/make_lastz_chains/test_out/query_partitions.txt
Num. target partitions: 0
Num. query partitions: 0
Num. lastz jobs: 0

### Lastz Alignment Step ###

LASTZ: making jobs
LASTZ: saved 1 jobs to /path/to/me/make_lastz_chains/test_out/temp_lastz_run/lastz_joblist.txt
Parallel manager: pushing job /burg/opt/nextflow/23.10.0/nextflow /path/to/me/make_lastz_chains/parallelization/execute_joblist.nf --joblist /path/to/me/make_lastz_chains/test_out/temp_lastz_run/lastz_joblist.txt -c /path/to/me/make_lastz_chains/test_out/temp_lastz_run/lastz_config.nf
N E X T F L O W  ~  version 23.10.0
Launching `/path/to/me/make_lastz_chains/parallelization/execute_joblist.nf` [boring_faggin] DSL2 - revision: 0483b29723
executor >  local (1)
[c0/1be141] process > execute_jobs (1) [100%] 1 of 1 ✔


### Nextflow process lastz finished successfully
Found 1 output files from the LASTZ step
Please note that lastz_step.py does not produce output in case LASTZ could not find any alignment

### Concatenating Lastz Results (Cat) Step ###

Concatenating LASTZ output from 1 buckets
* skip bucket bucket_ref_bulk_1: nothing to concat
An error occurred while executing cat: Error! No non-empty files found at /path/to/me/make_lastz_chains/test_out/temp_concat_lastz_output. The failed operation label is: cat_step
Traceback (most recent call last):
  File "/path/to/me/make_lastz_chains/modules/step_manager.py", line 70, in execute_steps
    step_result = step_to_function[step](params, project_paths, step_executables)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/me/make_lastz_chains/modules/pipeline_steps.py", line 58, in cat_step
    do_cat(params, project_paths, executables)
  File "/path/to/me/make_lastz_chains/steps_implementations/cat_step.py", line 51, in do_cat
    has_non_empty_file(project_paths.cat_out_dirname, "cat_step")
  File "/path/to/me/make_lastz_chains/modules/common.py", line 51, in has_non_empty_file
    raise PipelineFileNotFoundError(err_msg)
modules.error_classes.PipelineFileNotFoundError: Error! No non-empty files found at /path/to/me/make_lastz_chains/test_out/temp_concat_lastz_output. The failed operation label is: cat_step

ls -l test_out/temp_concat_lastz_output
total 0

Error in step 'fillChains'

Hi ,
I have been facing a problem with make_chains.py. When I align Target genome with query genome A (contigs 186, genome lenght 600Mb) worked perfectly fine. But thenwhen I want to align a query B (contigs 1500, 700 mb) against the same target, I always found an error in step fillChains. Species A and B are closely related. I was wondering what is wrong. I am attaching the .log file.
Thanks in advance,
Diego
make_chains.log

setup_genome_sequences fails if rerunning from the "cat" step

This is a low importance issue, because the use case is rare and the workaround is easy.

If make_lastz_chains is called with --continue_from_step cat, but the "cat" step has been run before, make_lastz_chains will fail with e.g.

### Trying to continue from step: cat
Making chains for /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/200-genomes/hg38.2bit and /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/210-repeatmasked/GCA_004027375.1_MacSob_v1_BIUU_genomic.2bit files, saving results to /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/300-chain/hg38/GCA_004027375.1_MacSob_v1_BIUU_genomic/out
Pipeline started at 2024-02-28 12:56:25.148054
 * Setting up genome sequences for target
genomeID: hg38
input sequence file: /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/200-genomes/hg38.2bit
is 2bit: True
planned genome dir location: /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/300-chain/hg38/GCA_004027375.1_MacSob_v1_BIUU_genomic/out/target.2bit
Traceback (most recent call last):
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/./make_chains.py", line 261, in <module>
    main()
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/./make_chains.py", line 257, in main
    run_pipeline(args)
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/./make_chains.py", line 233, in run_pipeline
    setup_genome_sequences(args.target_genome,
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/modules/project_setup_procedures.py", line 172, in setup_genome_sequences
    os.symlink(arg_input_2bit, seq_dir)
FileExistsError: [Errno 17] File exists: '/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/200-genomes/hg38.2bit' -> '/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/300-chain/hg38/GCA_004027375.1_MacSob_v1_BIUU_genomic/out/target.2bit'

Because the "target.2bit" file was symlinked from the earlier run, and the os.symlink fails if the destination already exists.

The user workaround is to delete "target.2bit" which fixes this easily.

The developer fix is to check and remove the destination if it exists, or use a symbolic linking function which tolerates already existing destinations. Checking briefly, the os.symlink function does not have this option.

Multiple Alignments

Dear Authors,

thank you very much for providing the pipeline! I would like to run the pipeline on different genomes together to compute whole genome alignment of different species to use them afterwards in TOGA.
Assume we have 4 assemblies: hg38, mm10, xs1 and ml2. Can I apply something like this:

./make_chains.py hg38 mm10 xs1 ml2 /path/to/hg38.fasta /path/to/mm10.fasta /path/to/xs1.fasta /path/to/ml2.fasta

Best,
Ahmad

PBS jobs were submitted, but failed quickly

Dear Prof @MichaelHiller

Thank you for developing such great software!

I'm having some problems with make_lastz_chain, everything seems to be working fine at first,

************ HgStepManager: executing step 'lastz' Wed Aug 31 22:44:00 2022.
doLastzClusterRun ....
testCleanState current step: doLastzClusterRun, /gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz, /gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz/lastz.done     previous step: doPartition, /gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz/partition.done
# chmod a+x /gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz/doClusterRun.sh
# /gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz/doClusterRun.sh
+ cd /gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz
+ gensub2 homSap.lst macFas.lst gsub jobList
+ parallel_executor.py lastz_homSapmacFas jobList -q day --memoryMb 10000 -e pbs --co None -p mem3T --eq 10
N E X T F L O W  ~  version 21.10.6
Launching `/gpfs/home/liunyw/howler_monkey/make_lastz_chain/homSap-macFas/TEMP_run.lastz/lastz_homSapmacFas/script.nf` [peaceful_booth] - revision: 01f1105dc4

but when the program submits the task, it keeps giving errors

executor >  pbs (12)
[bb/ba7c51] process > execute_jobs (13) [  0%] 4 of 2446, failed: 4, retries: 4
[70/224df0] NOTE: Process `execute_jobs (26)` terminated with an error exit status (127) -- Execution is retried (1)
[a4/c418e8] NOTE: Process `execute_jobs (1)` terminated with an error exit status (127) -- Execution is retried (1)
[06/393f6f] NOTE: Process `execute_jobs (2)` terminated with an error exit status (127) -- Execution is retried (1)
[19/e38b9b] NOTE: Process `execute_jobs (8)` terminated with an error exit status (127) -- Execution is retried (1)

executor >  pbs (13)
[d1/24b0e5] process > execute_jobs (3)  [  0%] 10 of 2452, failed: 10, retries: 10
[70/224df0] NOTE: Process `execute_jobs (26)` terminated with an error exit status (127) -- Execution is retried (1)
[a4/c418e8] NOTE: Process `execute_jobs (1)` terminated with an error exit status (127) -- Execution is retried (1)
[06/393f6f] NOTE: Process `execute_jobs (2)` terminated with an error exit status (127) -- Execution is retried (1)
[19/e38b9b] NOTE: Process `execute_jobs (8)` terminated with an error exit status (127) -- Execution is retried (1)
[ae/c82b60] NOTE: Process `execute_jobs (10)` terminated with an error exit status (127) -- Execution is retried (1)
[a8/693d83] NOTE: Process `execute_jobs (4)` terminated with an error exit status (127) -- Execution is retried (1)
[88/aff74e] NOTE: Process `execute_jobs (15)` terminated with an error exit status (127) -- Execution is retried (1)
[4c/e9351d] NOTE: Process `execute_jobs (11)` terminated with an error exit status (127) -- Execution is retried (1)
[40/8918e1] NOTE: Process `execute_jobs (16)` terminated with an error exit status (127) -- Execution is retried (1)
[0e/991e02] NOTE: Process `execute_jobs (20)` terminated with an error exit status (127) -- Execution is retried (1)

Could you help me?

Thank you very much

Xiaolin

issue in the lastz step

Hi,

I use this pipeline to create genome alignment file with local mode. Here is my script
make_chains.py human macAss hg38.2bit macAss.2bit --project_dir human_macAss --executor_queuesize 40

It ran fine at first with no problems. However, I've noticed that I seem to be experiencing some problems, reporting errors as follows

executor >  local (1158)
[cc/0029ca] process > execute_jobs (1225) [ 59%] 1118 of 1875

executor >  local (1159)
[3f/96e7bf] process > execute_jobs (1081) [ 59%] 1119 of 1875

executor >  local (1160)
[cf/920417] process > execute_jobs (1082) [ 59%] 1119 of 1875

executor >  local (1160)
[bb/e4ed9a] process > execute_jobs (26)   [ 59%] 1120 of 1876, failed: 1, ret...
[bb/e4ed9a] NOTE: Process `execute_jobs (26)` failed -- Execution is retried (1)

executor >  local (1161)
[91/7d94e9] process > execute_jobs (1083) [ 59%] 1121 of 1876, failed: 1, ret...
[bb/e4ed9a] NOTE: Process `execute_jobs (26)` failed -- Execution is retried (1)

executor >  local (1161)
[91/7d94e9] process > execute_jobs (1083) [ 59%] 1121 of 1876, failed: 1, ret...
[bb/e4ed9a] NOTE: Process `execute_jobs (26)` failed -- Execution is retried (1)
[78/143ab2] process > execute_jobs (1085) [ 59%] 1123 of 1876, failed: 1, ret...


executor >  local (1163)
[78/143ab2] process > execute_jobs (1085) [ 59%] 1123 of 1876, failed: 1, ret...

This is supposed to happen in the lastz step, and I noticed that these errors were reported after the ************ HgStepManager: executing step 'lastz' Sun Oct 15 09:08:53 2023. hint.

And there are 4 errors happening now,

[35/91e12f] process > execute_jobs (1394) [ 72%] 1370 of 1879, failed: 4, ret...

executor >  local (1411)
[64/039d77] process > execute_jobs (1395) [ 72%] 1371 of 1879, failed: 4, ret...

executor >  local (1411)
[64/039d77] process > execute_jobs (1395) [ 72%] 1371 of 1879, failed: 4, ret...

executor >  local (1412)
[c3/9f78eb] process > execute_jobs (1396) [ 73%] 1372 of 1879, failed: 4, ret...

what should I do to fix this?

Thank you in advance for your help.

Yawen

Error in 'lastz'

Hi!

I'm trying to run make_chains.py for betta fish and human, but I had errors in the last several jobs (~90% completed) of step 'lastz'. The error is like this
NOTE: Process execute_jobs (3909) terminated with an error exit status (1) -- Execution is retried (3)

As you suggeset in other similar issues, I applied 4-round RepeatModeler and RepeatMasker for the betta fish genome and download the human genome from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/ (Which is soft-masked). I also set the --seq1_chunk 50000000 --seq2_chunk 10000000 to reduce the chunk size. But the errors still occurs.

Could you help me figure out why? In addition, is there a way to rerun the failed jobs instead of running the lastz from the beginning?

Thanks!
Hongbing

Cant find the final chain file

I am trying to run this tool to get genome alignment chains (very useful, many thanks). It runs without any complains, and I can see in the log file that the program is completed. But, unfortunately I dont see the final chain file as output in the output directory I specify. Am I missing anything here ?

Here is what I am doing,

/Tools/make_lastz_chains/make_chains.py target_id query_id target.fa query.fa --project_dir Lastz_run1

It runs for a day and completed without any issue. I just dont see the output file. Any thoughts what may be wrong ?

V2 pipeline errors (cat_step; with my data NOT with test_data); V1 pipeline error:pipeline crashed

Hi @kirilenkobm @MichaelHiller,

I wget https://github.com/hillerlab/make_lastz_chains/archive/refs/heads/main.zip

installed all dependencies

Able to run make_chains.py successfully on test_data and got chained alignment BUT when I tried with my own data
after run of almost 16 hrs I got following error:

Lastz Alignment Step

LASTZ: making jobs
LASTZ: saved 968 jobs to /home/vlamba/make_genome-chaining-Feb3/test/temp_lastz_run/lastz_joblist.txt
Parallel manager: pushing job /share/apps/bioinformatics/nextflow/20.10.0/nextflow /scrfs/storage/vlamba/home/make_lastz_chains-main/parallelization/execute_joblist.nf --joblist /home/vlamba/make_genome-chaining-Feb3/test/temp_lastz_run/lastz_joblist.txt -c /home/vlamba/make_genome-chaining-Feb3/test/temp_lastz_run/lastz_config.nf

Nextflow process lastz finished successfully

Found 8 output files from the LASTZ step
Please note that lastz_step.py does not produce output in case LASTZ could not find any alignment

Concatenating Lastz Results (Cat) Step

Concatenating LASTZ output from 8 buckets

skip bucket bucket_ref_bulk_4: nothing to concat
skip bucket bucket_ref_bulk_2: nothing to concat
skip bucket bucket_ref_bulk_7: nothing to concat
skip bucket bucket_ref_bulk_5: nothing to concat
skip bucket bucket_ref_bulk_8: nothing to concat
skip bucket bucket_ref_bulk_1: nothing to concat
skip bucket bucket_ref_bulk_3: nothing to concat
skip bucket bucket_ref_bulk_6: nothing to concat
An error occurred while executing cat: Error! No non-empty files found at /home/vlamba/make_genome-chaining-Feb3/test/temp_concat_lastz_output. The failed operation label is: cat_step

Then I switch to V1 make_lastz_chains-1.0.0 and with test data I got following error:

[d1/4aa5ff] NOTE: Process execute_jobs (334) terminated with an error exit status (127) -- Execution is retried (2)
[b5/4f2f5d] NOTE: Process execute_jobs (332) terminated with an error exit status (127) -- Execution is retried (2)
[9c/4a16b7] NOTE: Process execute_jobs (625) terminated with an error exit status (127) -- Execution is retried (3)
[c7/851fa8] NOTE: Process execute_jobs (678) terminated with an error exit status (127) -- Execution is retried (2)
[ea/b72980] NOTE: Process execute_jobs (719) terminated with an error exit status (127) -- Execution is retried (2)
Error executing process > 'execute_jobs (67)'

Caused by:
Process execute_jobs (67) terminated with an error exit status (127)

Command executed:

/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/jobs/infillChain_158

Command exit status:
127

Command output:
..calling RepeatFiller:

Command error:
/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh: line 12: --workdir: command not found

Work dir:
/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/fillChain_targetquery/work/21/08e1e7d44488e6d8c672adaff8864f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

/home/vlamba/python3.14/lib/python3.9/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline fillChain_targetquery failed! Execute function returns 1.
warnings.warn(msg)
Uncaught exception from user code:
Command failed:
/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/doFillChain.sh
HgAutomate::run('/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/...') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/HgRemoteScript.pm line 117
HgRemoteScript::execute('HgRemoteScript=HASH(0xc3cb78)') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 735
main::doFillChains() called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/HgStepManager.pm line 169
HgStepManager::execute('HgStepManager=HASH(0xc39a18)') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 877
Error!!! Output file /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/target.query.allfilled.chain.gz not found!
The pipeline crashed. Please contact developers by creating an issue at:
https://github.com/hillerlab/make_lastz_chains

I would really appreciate any help/suggestion.

Thank you so much for your time

Pipeline crash during partition

Hello,

I'm trying to create chain alignments for whole primate genomes (novel sequences) against the vervet reference genome. I removed the spaces and tabs from the fasta file while allowed the process to being but now the pipeline is crashing at doPartition.bash

dwnloads/make_lastz_chains/make_chains.py --pd chain_files --force_def --DEF chain_files/DEF sabaeus neglectus chlSab1/vervet_nospace.fa GuenH/assembly.megabubbles.ns.fasta

Here's the log file: do partition log file.pdf

Please let me know what suggestions you have to fix this error.

Is there any parameter to set the maximum number of threads used?

Hi, is there any parameter to set the maximum number of threads used?
For example limiting the maximum number of threads used to 40

Partial Lastz Process Timeouts in make_lastz_chains

Hi,

I've been utilizing the make_lastz_chains tool with the following command:

/public/home/zhaohang/soft/make_lastz_chains-2.0.7/make_chains.py target query ${refgenome_softmasked} ${quegenome_softmasked} --pd chain_out -f --chaining_memory 100 --cluster_executor slurm --cluster_queue smp01,amd,low --keep_temp
I'm encountering an issue where some of the lastz subprocesses are timing out and failing to complete.

Could this be due to the large size of the genome or the high proportion of repetitive sequences? What would be the recommended approach to mitigate this issue? Should I reduce the chunk size or extend the runtime of the subprocesses, and if so, could you provide guidance on how to adjust these parameters effectively?

RepeatFiller

HgAutomate::run("TEMP_run.lastz"...) called at make_lastz_chains/doLastzChains/HgRemoteScript.pm line 117

HgRemoteScript::execute(HgRemoteScript=HASH(0x557801b9a758)) called at make_lastz_chains/doLastzChains/doLastzChain.pl line 370

main::doPartition() called at make_lastz_chains/doLastzChains/HgStepManager.pm line 169

HgStepManager::execute(HgStepManager=HASH(0x557801b45280)) called at make_lastz_chains/doLastzChains/doLastzChain.pl line 877

lastz options to obtain the "best path" alignments?

Hello again,
With your help, I was able to get chain alignments for human-mouse, human-zebrafish, etc.

I used the NCBI soft-masked genomes, and it seems like the repeat-masking was indeed not enough. I can see blocks of many dense and short diagonal lines, i.e. alignments from repeats that escaped the masking. I am re-doing the alignments using UCSC soft-masked (RepeatMasker + TRF?) genomes, plus genomes masked with RepeatModeler + RepeatMasker (as @MichaelHiller suggested).

We are interested in identifying the "best path" alignments, where the two genomes could be more or less 1:1 covered as continuously as possible. (This could become 1:many or many:many if either of both genomes has whole genome duplications, but I would like to think about it later...).

So here are a few questions:

Reading the lastz manual, it seems like there are options (e.g. --gfextend --chain --gapped in Fig 2f ) and steps (chaining, interpolation by --inner=, etc.) that could help longer and "cleaner" alignments. I wonder whether these steps and options were already applied. For example, is FILL_CHAIN=1 essentially performing interpolation?
What does CLEANCHAIN=1 do? Will it try to reduce the number of e.g. non-reciprocal chains?
Other than the more extensive repeat-masking, would there be any other suggestions to obtain the "best" or better paths? Would modifying the H, Y, L, and K parameters be helpful in any way?

Thanks again!
Dong-Ha

NOTE: Process `execute_jobs (605)` terminated with an error exit status (1) -- Execution is retried (1)

Hi !
I get this error if I use 'slurm' to submit the job,
but I don't get it if I run the command directly on another server.

The programme was running normally for some time before this error occurred

The version of my nextflow is "version 23.10.0.5889",
I am trying to downgrade nextflow below 22.12.0, as mentioned in "#18".

The following is a portion of the run log file
“
executor > local (627)
[77/314080] process > execute_jobs (618) [ 25%] 620 of 2414

executor > local (627)
[77/314080] process > execute_jobs (618) [ 25%] 621 of 2414

executor > local (628)
[33/b87af4] process > execute_jobs (620) [ 25%] 621 of 2414

executor > local (629)
[a0/766938] process > execute_jobs (605) [ 25%] 622 of 2415, failed: 1, retri...
[a0/766938] NOTE: Process execute_jobs (605) terminated with an error exit status (1) -- Execution is retried (1)

executor > local (630)
[37/c0e27f] process > execute_jobs (622) [ 25%] 623 of 2416, failed: 2, retri...
[5f/02fc3c] NOTE: Process execute_jobs (606) terminated with an error exit status (1) -- Execution is retried (1)

executor > local (631)
[8f/b9b0ef] process > execute_jobs (624) [ 25%] 624 of 2416, failed: 2, retri...

executor > local (632)
[67/ff4d54] process > execute_jobs (623) [ 25%] 625 of 2416, failed: 2, retri...

”

NotImplementedError: Executor pbs is not supported, abort

Hi @kirilenkobm,

I've been trying to run make_lastz_chains on my dataset and I ran the following command:
./make_chains.py homSap gorGor ./raw-genome/homSap.sm.fa ./raw-genome/gorGor.sm.fa --executor pbs --executor_queuesize 100 --project_dir homSap-gorGor

However I have encountered an error message.

************ HgStepManager: executing step 'lastz' Sat May  7 20:48:40 2022.
doLastzClusterRun ....
testCleanState current step: doLastzClusterRun, /gpfs/home/liunyw/howler_monkey/gene_loss/homSap-gorGor/TEMP_run.lastz, /gpfs/home/liunyw/howler_monkey/gene_loss/homSap-gorGor/TEMP_run.lastz/lastz.done     previous step: doPartition, /gpfs/home/liunyw/howler_monkey/gene_loss/homSap-gorGor/TEMP_run.lastz/partition.done
# chmod a+x /gpfs/home/liunyw/howler_monkey/gene_loss/homSap-gorGor/TEMP_run.lastz/doClusterRun.sh
# /gpfs/home/liunyw/howler_monkey/gene_loss/homSap-gorGor/TEMP_run.lastz/doClusterRun.sh
+ cd /gpfs/home/liunyw/howler_monkey/gene_loss/homSap-gorGor/TEMP_run.lastz
+ gensub2 homSap.lst gorGor.lst gsub jobList
+ parallel_executor.py lastz_homSapgorGor jobList -q day --memoryMb 10000 -e pbs --co None -p batch --eq 100
Traceback (most recent call last):
  File "/gpfs/home/liunyw/biosoft/make_lastz_chains/HL_scripts/parallel_executor.py", line 176, in <module>
    main()
  File "/gpfs/home/liunyw/biosoft/make_lastz_chains/HL_scripts/parallel_executor.py", line 155, in main
    queue=_q_arg,
  File "/gpfs/home/liunyw/soft/python3/lib/python3.6/site-packages/py_nf/py_nf.py", line 125, in __init__
    self.__check_executor()
  File "/gpfs/home/liunyw/soft/python3/lib/python3.6/site-packages/py_nf/py_nf.py", line 216, in __check_executor
    raise NotImplementedError(msg)
NotImplementedError: Executor pbs is not supported, abort

This looks like the --executor parameter is not set properly. However, what confuses me is that the Executors of cluster I am using is pbs and I have used process.executor = 'pbs' in my TOGA test data(nextflow_config_files/call_cesar_config_template.nf) and it worked.

Thank you
Yawen

Stalled at the doChainRun step

Hi, Bogdan and Michael @kirilenkobm @MichaelHiller,
This was originally a comment on issue #9. Then, I realize it's a separate issue.

When I tried human vs. Chimpanzee and human vs. Macaca mulatta, the runs stalled at the doChainRun step. All runs started to fail after trying a couple of the 25 doChainRun.csh jobs (which, to my understanding, = axtChain | chainAntiRepeat to convert 25 psl chunks to 25 chained axt chunks). It started to fail, retry, and so on, and did not appear to move.

Please see this log file for an example of the failed runs: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0__failed.log

Could you suggest a parameter I can change to make the doChainRun step work?

More specifically:

Will increasing the --chaining_memory (default 50000) help doChainRun step?
What is the meaning of --chaining_memory parameter? Is it total RAM (in MB?) assigned to each of the 25 jobs? Can I use the total RAM allowed for the entire node (e.g. 192000 for 192GB)?
Is there a way to make more than 25 chunks of psl files if this happens because each chunk is too big?

Thanks!
Dong-Ha

alignment is taking too long

Could you please tell if something is wrong with this command as it is running even after 4 days. It started with running in a cluster occupying a total of around 500 single core jobs but around 125 jobs are still running.

make_chains.py --project_dir OUTPUT --executor slurm --cluster_parameters '-A AB-123456' --executor_partition defq --force_def chicken bird chicken.2bit bird.2bit

The issue with the temporary file generation path.

Hello,

when running make_chains.py on two 700MB genomes, the temporary files were stored in /dev/mapper/cl-root, which caused the storage of /dev/mapper/cl-root to be filled up and eventually resulted in the termination of make_chains.py. Could you please advise on how to modify the temporary file generation path for make_chains.py? Thank you very much for your help.

##Centos7

An error occurred while executing cat: Error! No non-empty files found at /home/vlamba/CotDis_chaining/temp_concat_lastz_output. The failed operation label is: cat_step

Hi authors,

I ran the ./make_chains.py Cot Dis /home/vlamba/target.fa /storage/vlamba/data/Genomes-noto/NOTO-genome/query.fa --project_dir /home/vlamba/CotDis_chaining -f --chaining_memory 30 and got the following error:

`### Nextflow process lastz finished successfully
Found 8 output files from the LASTZ step
Please note that lastz_step.py does not produce output in case LASTZ could not find any alignment

Concatenating Lastz Results (Cat) Step

Concatenating LASTZ output from 8 buckets

skip bucket bucket_ref_bulk_4: nothing to concat
skip bucket bucket_ref_bulk_2: nothing to concat
skip bucket bucket_ref_bulk_7: nothing to concat
skip bucket bucket_ref_bulk_5: nothing to concat
skip bucket bucket_ref_bulk_8: nothing to concat
skip bucket bucket_ref_bulk_1: nothing to concat
skip bucket bucket_ref_bulk_3: nothing to concat
skip bucket bucket_ref_bulk_6: nothing to concat
An error occurred while executing cat: Error! No non-empty files found at /home/vlamba/CotDis_chaining/temp_concat_lastz_output. The failed operation label is: cat_step`

Could you please suggest If I did something wrong?

Thank you

Allow custom process.time

Some doLastzClusterRun jobs exceed the hardcoded process.time of two days.

make_lastz_chains/constants.py

Line 110 in 187e313

JOB_TIME_REQ = '48h'

The default behavior for these jobs seems to be to rerun. However, the result appears to be an infinitely rerunning failing job. (Maybe there is an upper limit to the number of reruns, but for two day--long jobs I haven't seen this reach an upper limit).

An easy solution is to allow customizable JOB_TIME_REQ by setting 48h as the default and allowing user overrides by command line option.

I suppose it is also possible to split the genomes into smaller chunks, but I do not know enough about genome-genome alignments to say if that might affect the resulting alignments. If it might affect alignment results, I would prefer increasing JOB_TIME_REQ.

This issue is related to #43 and perhaps #48.

Personally, a very long JOB_TIME_REQ is fine for me, since I run the Nextflow process with local executor within a single large cluster job.

Uncaught exception from user code: Command failed.

Dear Developers,

Than you very much for your software.
I tried to make chains file and the following message occurred:

/home/popov/make_lastz_chains/make_chains.py hg38 my_gen /home/popov/Human_genome_38/GRCh38.primary_assembly.genome.ordered.fasta /home/popov/primary.genome.scf.fasta.scaffolds.fa --project_dir /home/popov/chains_first_attempt --executor slurm --executor_partition genetics
Project directory: /home/popov/chains_first_attempt
Target path: /home/popov/chains_first_attempt/hg38.2bit | chrom sizes: /home/popov/chains_first_attempt/hg38.chrom.sizes
Query path: /home/popov/chains_first_attempt/my_gen.2bit | query sizes: /home/popov/chains_first_attempt/my_gen.chrom.sizes
Calling: /home/popov/chains_first_attempt/master_script.sh...
BIN: /home/popov/make_lastz_chains/doLastzChains
which: no RepeatFiller.py in (/home/popov/make_lastz_chains/GenomeAlignmentTools/src:/home/popov/make_lastz_chains/kent_binaries:/home/popov/make_lastz_chains/HL_kent_binaries:/home/popov/make_lastz_chains/HL_scripts:/home/popov/make_lastz_chains/doLastzChains:/home/popov/miniconda3/envs/togaenv/bin:/opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.3/bin:/opt/ohpc/pub/compiler/gcc/8.3.0/bin:/opt/ohpc/pub/utils/prun/1.3:/opt/ohpc/pub/utils/autotools/bin:/opt/ohpc/pub/bin:/usr/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/popov/.local/bin:/home/popov/bin)
PARAMETERS:
clusterRunDir /home/popov/chains_first_attempt
chainMinScore 1000
chainLinearGap loose
maxNumLastzJobs 6000
numFillJobs 1000
verbose 1
debug 0
/home/popov/chains_first_attempt/DEF looks OK!
tDb=hg38
qDb=my_gen
seq1dir=/home/popov/chains_first_attempt/hg38.2bit
seq2dir=/home/popov/chains_first_attempt/my_gen.2bit
Will run RepeatFiller.
Will run chainCleaner with parameters: -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000
max number of lastz cluster jobs: 6000
The /home/popov/make_lastz_chains/doLastzChains/doLastzChain.pl run will be performed in directory /home/popov/chains_first_attempt. All temp files will be written there.
HgStepManager: executing from step 'partition' through step 'cleanChains'.

************ HgStepManager: executing step 'partition' Thu Sep 7 17:58:05 2023.
doPartition ....
doPartition: seq2MaxLength ....
doPartition: seq2MaxLength = 57195878
doPartition: partitionTargetCmd partitionSequence.pl 175000000 0 /home/popov/chains_first_attempt/hg38.2bit /home/popov/chains_first_attempt/hg38.chrom.sizes -xdir xdir.sh -rawDir /home/popov/chains_first_attempt/TEMP_psl 4000 -lstDir tParts > hg38.lst
doPartition: partitionQueryCmd partitionSequence.pl 50000000 10000 /home/popov/chains_first_attempt/my_gen.2bit /home/popov/chains_first_attempt/my_gen.chrom.sizes 10000 -lstDir qParts > my_gen.lst
content of /home/popov/chains_first_attempt/TEMP_run.lastz/doPartition.bash
#chmod a+x /home/popov/chains_first_attempt/TEMP_run.lastz/doPartition.bash
#/home/popov/chains_first_attempt/TEMP_run.lastz/doPartition.bash

cd /home/popov/chains_first_attempt/TEMP_run.lastz
partitionSequence.pl 175000000 0 /home/popov/chains_first_attempt/hg38.2bit /home/popov/chains_first_attempt/hg38.chrom.sizes -xdir xdir.sh -rawDir /home/popov/chains_first_attempt/TEMP_psl 4000 -lstDir tParts
++ wc -l
export L1=25
L1=25
partitionSequence.pl 50000000 10000 /home/popov/chains_first_attempt/my_gen.2bit /home/popov/chains_first_attempt/my_gen.chrom.sizes 10000 -lstDir qParts
++ wc -l
export L2=43
L2=43
++ echo 25 43
++ awk '{print $1*$2}'
export L=1075
L=1075
echo 'cluster batch jobList size: 1075 = 25 * 43'
cluster batch jobList size: 1075 = 25 * 43
'[' -d tParts ']'
echo 'constructing tParts/.2bit files'
constructing tParts/.2bit files
sed -e 's#tParts/##; s#.lst##;'
read tPart
ls tParts/part000.lst tParts/part001.lst tParts/part002.lst tParts/part003.lst tParts/part004.lst tParts/part005.lst tParts/part006.lst tParts/part007.lst tParts/part008.lst tParts/part009.lst tParts/part010.lst tParts/part011.lst tParts/part012.lst tParts/part013.lst tParts/part014.lst
sed -e 's#.*.2bit:##;' tParts/part000.lst
twoBitToFa -seqList=stdin /home/popov/chains_first_attempt/hg38.2bit stdout
faToTwoBit stdin tParts/part000.2bit
/home/popov/chains_first_attempt/TEMP_run.lastz/doPartition.bash: line 22: /home/popov/make_lastz_chains/kent_binaries/twoBitToFa: Permission denied
Uncaught exception from user code:
Command failed:
/home/popov/chains_first_attempt/TEMP_run.lastz/doPartition.bash
HgAutomate::run('/home/popov/chains_first_attempt/TEMP_run.lastz/doPartition.bash') called at /home/popov/make_lastz_chains/doLastzChains/HgRemoteScript.pm line 117
HgRemoteScript::execute('HgRemoteScript=HASH(0x2196128)') called at /home/popov/make_lastz_chains/doLastzChains/doLastzChain.pl line 370
main::doPartition() called at /home/popov/make_lastz_chains/doLastzChains/HgStepManager.pm line 169
HgStepManager::execute('HgStepManager=HASH(0x2195678)') called at /home/popov/make_lastz_chains/doLastzChains/doLastzChain.pl line 877
Error!!! Output file /home/popov/chains_first_attempt/hg38.my_gen.allfilled.chain.gz not found!
The pipeline crashed. Please contact developers by creating an issue at:
https://github.com/hillerlab/make_lastz_chains

Can you please tell what could be the reason of this problem and how to solve it?

TODO? QC/statistics module

Idea for the future: add a module that accumulates detailed statistics for each pipeline step.
Like, as how many chains we get for each chromosome after axtChain, how many blocks of what length, how many new blocks we get at the fill chain step, and other information that may be relevant for QC.

Full scale run creates too many lastz jobs

If the reference has many short chromosomes/scaffolds, it results in a tremendous amount of last jobs (hundreds of thousands).
The partitioning step must handle such cases.
For example, by merging such chromosomes/scaffolds into "buckets", processed as a single unit by run_lastz.py

Resuming a stalled run.

Hi again,

We have been running make_lastz_chains for combinations of species pairs, masking options, and LastZ parameter sets. The pipeline usually runs smooth, except a couple of times when the comparison was heavy (larger and more closely related genomes).

On one occasion, the primary lastz alignment step (=total 1,825 jobs) was finished, but in a later step (doChainRun), the pipeline was stalled, failing and retrying 25 jobs. Here is the command used (run on a node with 32-core, with 4GB RAM per core):

echo -e "FILL_CHAIN=0" > DEF_noFillChain # we were skipping RepeatFiller for this run
make_chains.py h38w5 cPTRv2w5 h38_primaryAssembly.WMt98.0.fa cPTRv2.WMt98.0.fa \
   --DEF DEF_noFillChain --executor_queuesize 32 \
   --project_dir 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0 \
   > 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0.log 2>&1

And the log file: 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0__failed.log

We can see the following steps were done:

$ ls -ltrh 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_*/*.done
-rw-r--r-- 1 ohd3 contig 0 Oct 10 09:36 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/partition.done
-rw-r--r-- 1 ohd3 contig 0 Oct 11 23:16 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/lastz.done
-rw-r--r-- 1 ohd3 contig 0 Oct 11 23:31 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.cat/cat.done

Hence, we wanted to resume the run on a node with more RAM (8GB per core). We first tried with --resume and received this complaint (path simplified):

Confusion: ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/DEF already exists
Please set --force_def to override

So we tried again with the following:

echo -e "FILL_CHAIN=0" > DEF_noFillChain # we were skipping RepeatFiller for this run
make_chains.py h38w5 cPTRv2w5 h38_primaryAssembly.WMt98.0.fa cPTRv2.WMt98.0.fa \
   --resume --force_def \
   --DEF DEF_noFillChain --executor_queuesize 32 \
   --project_dir 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0 \
   > 221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume.log 2>&1

Now the complaint included this line (paths simplified):

doPartition: looks like doPartition was already successful (./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/partition.done exists).
Either -continue {some later step} or move aside/remove ./221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0/TEMP_run.lastz/ and run again.

Questions:

Since removing the TEMP_run.lastz folder may make the pipeline run the 1,825 lastz jobs again, we would like to try -continue {some later step}. What should we use as the {some later step}? Will it be doChainRun? Is there a list of steps where we can resume the pipeline?
Is there any other parameters we can set when resuming? Will increasing --chaining_memory CHAINING_MEMORY to, say, 100000 or 200000 help? The default here seems to be 50000 (MB? per core?).

Thanks a lot!
Dong-Ha

P.S.
In case needed, here are the full resume log files:
221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__1stAttempt.log
221009_LastZ_chain_h38w5cPTRv2w5_mlcV2F0_resume__2ndAttempt.log

Stats for chain + psl (or axt) intermediate files?

Hello again,
I am now exploring the first chain output I obtained from a successful run :)

I am curious how we can get basic stats for the alignments:

What portions of genomes were covered by the primary alignment (.psl or .axt) and the chains (.chain)?
Were some portions of genomes covered multiple times (if so, how much and how many times)?
What are the length distributions of all primary and chained alignments?
Distributions of distances (e.g. substitutions per site or any other measure of similarities) among alignments?

I can extract and process the headers of the chain file to answer some of the above questions, but I wonder if there are already tools to do so.

Plus, especially for item 4, we may need the .psl (or .axt?) alignment since .chain appears to have dropped the information on the similarity scores of alignments. Is there a way to retrieve a concatenated .psl alignment file?

Here are the files and folders in the output directory after the run:

$ ls -1tr 220725_LastZ_chain_DmeDsi_2ndAttempt
Dme.2bit
Dme.chrom.sizes
Dsi.2bit
Dsi.chrom.sizes
DEF
make_chains_py_params.json
master_script.sh
TEMP_psl
TEMP_run.lastz
TEMP_pslParts
TEMP_run.cat
TEMP_run.fillChain
Dme.Dsi.allfilled.chain.gz
TEMP_axtChain
cleanUp.csh

Thanks!

requirements

I think there might be some missing requirements

faToTwoBit
twoBitToFa
gensub2
Nextflow: version beyond 22.10.x are not anymore supporting DSL1

Argument list too long - lastz

Hi I am trying to chain the chicken and turkey genomes but I am getting this error :

/nas5/aluzuriaganeira/birds/TOGA/make_lastz_chains/chicken_turkey/temp_lastz_run/work/33/98c1845c4ef249c5f4bae4eb8e5517/.command.sh: line 2: /nas5/aluzuriaganeira/birds/TOGA/make_lastz_chains/standalone_scripts/run_lastz_intermediate_layer.py: Argument list too long

Any ideas on how to fix it, I am running it in a cluster with 50 cpus and 80GB of memory so I don't think is a problem of resources.

Alignment options for distant species,

Hi, Bogdan and Michael @kirilenkobm @MichaelHiller,

Thanks again for developing this pipeline and also for responding to our requests. We have successfully used the pipeline for species pairs with MASH distances ranging from 0.1 to 0.27, with the default parameters.

In the previous ticket #10, the issue was runtime and RAM usage for close species (e.g., MASH distance <0.1, such as human vs. primates). How about more distant species pairs, e.g., human vs. zebrafish with MASH distance 0.29?

The default make_lastz_chains parameters are already quite sensitive. But I'd like to know if there is a way to tweak it even more for distant species, aiming to retrieve as much synteny information for orthologs as possible.

UCSC has these example parameter sets:

chainNear="-minScore=5000 -linearGap=medium"
chainMedium="-minScore=3000 -linearGap=medium"
chainFar="-minScore=5000 -linearGap=loose"
lastzNear="C=0 E=150 H=0 K=4500 L=3000 M=254 O=600 Q=/scratch/data/blastz/human_chimp.v2.q T=2 Y=15000"
lastzMedium="C=0 E=30 H=0 K=3000 L=3000 M=50 O=400 T=1 Y=9400"
lastzFar="C=0 E=30 H=2000 K=2200 L=6000 M=50 O=400 Q=/scratch/data/blastz/HoxD55.q T=2 Y=3400"

make_lastz_chains can accept different K, L, H, and Y options, and as Michael pointed out in an email with us, K and L might be the most important. I will try with K=2200 and L=6000 ( make_lastz_chains default: K=2400 and L=3000)

Could you also add options to control the following:

the "-minScore" for the chaining step (I think make_lastz_chains already uses -linearGap=loose as default).
the Q parameter, i.e., the matrix file, so that I can test if using HoxD55.q would make a difference.

Plus, I would appreciate any other suggestions for distant species, with the aim of retrieving as much synteny information for orthologs as possible.

Thanks!
Dong-Ha

The pipeline crashed Error

Hi Authors

I tried the V1 of this pipeline but got the following error.

[d1/4aa5ff] NOTE: Process execute_jobs (334) terminated with an error exit status (127) -- Execution is retried (2)
[b5/4f2f5d] NOTE: Process execute_jobs (332) terminated with an error exit status (127) -- Execution is retried (2)
[9c/4a16b7] NOTE: Process execute_jobs (625) terminated with an error exit status (127) -- Execution is retried (3)
[c7/851fa8] NOTE: Process execute_jobs (678) terminated with an error exit status (127) -- Execution is retried (2)
[ea/b72980] NOTE: Process execute_jobs (719) terminated with an error exit status (127) -- Execution is retried (2)
Error executing process > 'execute_jobs (67)'

Caused by:
Process execute_jobs (67) terminated with an error exit status (127)

Command executed:

/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/jobs/infillChain_158

Command exit status:
127

Command output:
..calling RepeatFiller:

Command error:
/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh: line 12: --workdir: command not found

Work dir:
/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/fillChain_targetquery/work/21/08e1e7d44488e6d8c672adaff8864f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

self chains

Hi
I want to do chainself. Can this pipeline do chainself?

NOTE: Process `execute_jobs (64)` terminated with an error exit status (2) -- Execution is retried (1)

Could you please help with?

N E X T F L O W  ~  version 21.10.1
Launching `XXXXX/TEMP_run.lastz/lastz_chickenNB/script.nf` [maniac_yonath] - revision: 20fae9dc0e
[-        ] process > execute_jobs -

executor >  local (17)
[37/e70339] process > execute_jobs (4) [  0%] 0 of 190

executor >  local (49)
[b9/89be59] process > execute_jobs (19) [  0%] 0 of 225

executor >  local (50)
[2a/0acf5d] process > execute_jobs (46) [  0%] 1 of 226, failed: 1, retries: 1
[2a/0acf5d] NOTE: Process `execute_jobs (46)` terminated with an error exit status (2) -- Execution is retried (1)

executor >  local (51)
[3c/cbb0c0] process > execute_jobs (51) [  0%] 2 of 227, failed: 2, retries: 2
[d2/aff088] NOTE: Process `execute_jobs (64)` terminated with an error exit status (2) -- Execution is retried (1)

nextflow

Hi
I met a error about nextflow. I used nextflow version 20 in ubuntu. Could you help me?
N E X T F L O W ~ version 20.10.0
Launching /software/make_lastz_chains/Tele2Ppup/TEMP_run.lastz/lastz_PpupTele/script.nf [romantic_albattani] - revision: d48f32974e
/tmp/nxf-13131159217981153035
/software/miniconda3/envs/NF/lib/python3.8/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline lastz_PpupTele failed! Execute function returns 1.

The pipeline does not work with 2bit files as input

The pipeline is planned to work with both fasta and twoBit formats as input.
However, the 2.0.5 could not handle 2bit input.
To be fixed in the 2.0.6

Issue Running make_lastz_chains-2.0.6 on Test Data

Hello,

I encountered an issue when running the test data with the following command:
make_chains.py target query test_data/test_reference.fa test_data/test_query.fa --pd test_out -f --chaining_memory 16

The error I received is as follows:

### Nextflow process chain_run finished successfully
An error occurred while executing chain_run: Error! No non-empty files found at /home/zhaohang/soft/make_lastz_chains-2.0.6/test_out/temp_chain_run/chain. The failed operation label is: chain_run

Traceback (most recent call last):
  File "/home/zhaohang/soft/make_lastz_chains-2.0.6/modules/step_manager.py", line 70, in execute_steps
    step_result = step_to_function[step](params, project_paths, step_executables)
  File "/home/zhaohang/soft/make_lastz_chains-2.0.6/modules/pipeline_steps.py", line 64, in chain_run_step
    do_chain_run(params, project_paths, executables)
  File "/home/zhaohang/soft/make_lastz_chains-2.0.6/steps_implementations/chain_run_step.py", line 113, in do_chain_run
    has_non_empty_file(project_paths.chain_output_dir, "chain_run")
  File "/home/zhaohang/soft/make_lastz_chains-2.0.6/modules/common.py", line 51, in has_non_empty_file
    raise PipelineFileNotFoundError(err_msg)
modules.error_classes.PipelineFileNotFoundError: Error! No non-empty files found at /home/zhaohang/soft/make_lastz_chains-2.0.6/test_out/temp_chain_run/chain. The failed operation label is: chain_run

Any assistance on this would be greatly appreciated.

Thank you!

Stalled at doPartition

Hi!

I was running the command: make_chains.py MALLARD QUERY_GENOME GCF_015476345.1_ZJU1.0_genomic.fna GCA_907165065.1_bCapEur3.1_genomic.fna --project_dir /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME --force_def --chaining_memory 200000 --seq1_chunk 50000000 --seq2_chunk 10000000

With error output:

Target path: /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.2bit | chrom sizes: /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.chrom.sizes
Query path: /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/QUERY_GENOME.2bit | query sizes: /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/QUERY_GENOME.chrom.sizes
Calling: /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/master_script.sh...
BIN: /beegfs/store4/chenyangkang/software/make_lastz_chains/doLastzChains
PARAMETERS:
	clusterRunDir /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME
	chainMinScore 1000
	chainLinearGap loose
	maxNumLastzJobs 6000
	numFillJobs 1000
	verbose 1
	debug 0
/beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/DEF looks OK!
	tDb=MALLARD
	qDb=QUERY_GENOME
	seq1dir=/beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.2bit
	seq2dir=/beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/QUERY_GENOME.2bit
Will run RepeatFiller. 
Will run chainCleaner with parameters: -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000
max number of lastz cluster jobs: 6000
The /beegfs/store4/chenyangkang/software/make_lastz_chains/doLastzChains/doLastzChain.pl run will be performed in directory /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME. All temp files will be written there.
HgStepManager: executing from step 'partition' through step 'cleanChains'.

************ HgStepManager: executing step 'partition' Mon Jun  5 19:59:39 2023.
doPartition ....
doPartition: seq2MaxLength ....
doPartition: seq2MaxLength = 126318510
doPartition: partitionTargetCmd partitionSequence.pl 50000000 0 /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.2bit /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.chrom.sizes -xdir xdir.sh -rawDir /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_psl 4000 -lstDir tParts > MALLARD.lst
doPartition: partitionQueryCmd partitionSequence.pl 10000000 10000 /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/QUERY_GENOME.2bit /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/QUERY_GENOME.chrom.sizes 10000 -lstDir qParts > QUERY_GENOME.lst
content of /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_run.lastz/doPartition.bash
# chmod a+x /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_run.lastz/doPartition.bash
# /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_run.lastz/doPartition.bash
+ cd /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_run.lastz
+ partitionSequence.pl 50000000 0 /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.2bit /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.chrom.sizes -xdir xdir.sh -rawDir /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_psl 4000 -lstDir tParts
lstDir tParts must be empty, but seems to have files  (part001.lst ...)
Uncaught exception from user code:
	Command failed:
	/beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/TEMP_run.lastz/doPartition.bash
	HgAutomate::run('/beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_projec...') called at /beegfs/store4/chenyangkang/software/make_lastz_chains/doLastzChains/HgRemoteScript.pm line 117
	HgRemoteScript::execute('HgRemoteScript=HASH(0x20cb858)') called at /beegfs/store4/chenyangkang/software/make_lastz_chains/doLastzChains/doLastzChain.pl line 370
	main::doPartition() called at /beegfs/store4/chenyangkang/software/make_lastz_chains/doLastzChains/HgStepManager.pm line 169
	HgStepManager::execute('HgStepManager=HASH(0x20c8e20)') called at /beegfs/store4/chenyangkang/software/make_lastz_chains/doLastzChains/doLastzChain.pl line 877
Error!!! Output file /beegfs/store4/chenyangkang/06.ebird_data/43.Phenology_project_files/data/chains/QUERY_GENOME/MALLARD.QUERY_GENOME.allfilled.chain.gz not found!
The pipeline crashed. Please contact developers by creating an issue at:
https://github.com/hillerlab/make_lastz_chains

The header of the sequence is quite simple:

>BCE_CHR_109
>BCE_CHR_110
>BCE_CHR_111

>ZJU_CHR_748
>ZJU_CHR_749
>ZJU_CHR_750

Any hint? Thanks!

Yangkang

Fatal non-specific error if genome names are the same as genome filenames

If the target_name/query_name has the same value as the basename of target_genome/query_genome, then the pipeline fails with the "no non-empty files" non-specific error:

### Concatenating Lastz Results (Cat) Step ###

Concatenating LASTZ output from 1 buckets
* skip bucket bucket_ref__chrX_in_0_156040895: nothing to concat
An error occurred while executing cat: Error! No non-empty files found at ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/temp_concat_lastz_output. The failed operation label is: cat_step
Traceback (most recent call last):
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/modules/step_manager.py", line 70, in execute_steps
    step_result = step_to_function[step](params, project_paths, step_executables)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/modules/pipeline_steps.py", line 58, in cat_step
    do_cat(params, project_paths, executables)
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/steps_implementations/cat_step.py", line 51, in do_cat
    has_non_empty_file(project_paths.cat_out_dirname, "cat_step")
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/modules/common.py", line 51, in has_non_empty_file
    raise PipelineFileNotFoundError(err_msg)
modules.error_classes.PipelineFileNotFoundError: Error! No non-empty files found at ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/temp_concat_lastz_output. The failed operation label is: cat_step

In fact, the initial error seems to be from Lastz. Tracing the chains_joblist commands, I see the command,

~/repos/toga-pipeline/inter/100-install/make_lastz_chains/standalone_scripts/run_lastz_intermediate_layer.py ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/target.2bit:chrX:0-156040895 ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/query.2bit:chrX:0-50000000 ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/pipeline_parameters.json ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/temp_lastz_psl_output/bucket_ref__chrX_in_0_156040895/chrX_chrX__1.psl ~/repos/toga-pipeline/inter/100-install/make_lastz_chains/standalone_scripts/run_lastz.py --output_format psl --axt_to_psl ~/repos/toga-pipeline/.snakemake/conda/d1516d7b42018368bb384d7b98ada0c3_/bin/axtToPsl

Which, when run, gives a traceback point to Lastz:

Traceback (most recent call last):
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/standalone_scripts/run_lastz.py", line 197, in call_lastz
    lastz_out = subprocess.check_output(cmd, shell=True, stderr=subprocess.PIPE).decode("utf-8")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/repos/toga-pipeline/.snakemake/conda/d1516d7b42018368bb384d7b98ada0c3_/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/repos/toga-pipeline/.snakemake/conda/d1516d7b42018368bb384d7b98ada0c3_/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'lastz "~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/target.2bit/chrX[1,156040895][multiple]" "~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/query.2bit/chrX[1,50000000][multiple]" Y=9400 H=2000 L=3000 K=2400 --traceback=800.0M --format=axt+' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/standalone_scripts/run_lastz.py", line 336, in <module>
    main()
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/standalone_scripts/run_lastz.py", line 317, in main
    lastz_output = call_lastz(cmd)
                   ^^^^^^^^^^^^^^^
  File "~/repos/toga-pipeline/inter/100-install/make_lastz_chains/standalone_scripts/run_lastz.py", line 201, in call_lastz
    raise LastzProcessError(
LastzProcessError: Lastz command failed with exit code 1. Error message: FAILURE: fopen_or_die failed to open "~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/target.2bit/chrX" for "rb"

When I checked ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/target.2bit/chrX, I found that ~/repos/toga-pipeline/inter/300-chain/hg38-chrX.2bit/mm10-chrX.2bit/out/target.2bit is a soft-link to a file, which is probably why Lastz failed, because it thinks that is a directory (?).

When I changed the target_name/query_name such that it was different from the basename of the target_genome/query_genome files (and deleted all the intermediate files), this error resolved. I regret that I do not have time to investigate further and provide a more detailed report.

error in the step 'lastz

Hi,

I have been running make_lastz_chain with several species (all closely related sps, >50MYA divergence) and it works great. However, for only 3 species I have had an error in the last jobes (>90% completed) of step 'lastz'. I was wondering if you could identify the error in these particular sps. (I am running all in def.)

g_multiradiatus_picta.txt

Thanks in advance,
Diego

unable to run make lastz chains on slurm cluster- where do I edit config file?

Hello! I am trying to make genome alignment chains using your pipeline, but I keep getting an error related to the nextflow submission process. It fails to submit the joblist that it produces, giving me this error message:

N E X T F L O W ~ version 23.04.0 Launching.../TOGA/make_lastz_chains/parallelization/execute_joblist.nf`
ERROR ~ Error executing process > 'execute_jobs (12)'

Caused by:
Failed to submit process to grid scheduler for execution

Command executed:

sbatch .command.run

Command exit status:
1

Command output:
sbatch: error: invalid partition specified: batch
sbatch: error: Batch job submission failed: Invalid partition name specified`

I looked into the execute_joblist.nf file, and it calls a config file for submissions. I believe editing the partition in the config file will probably fix this process, but I am not sure where to find it. Any help with this would be much appreciated!

Details about input data:
I softmasked the genomes (around 30-40% of genome is soft-masked), and then turned them into 2bit files using faToTwoBit.

Error at fillChains step

Hi @kirilenkobm,

Since the response of Michael in a previous issue, I successfully masked 3 dog genomes available (35-43%) and the lastz step was done within the expected time (~ 3 days on a 32-core machine). Sadly, I am stuck at the fillChains step with various error exit status. I started running make_lastz_chains with ROS_Cfam_1.0 (NCBI reference; 43% masked) against hg38 as follows:

./make_chains.py "human" "dog" "hg38_masked.fa" "ROS_Cfam_1.0.masked.fa" --queue_executor 32 --project_dir test

I re-run the job like 3 times after the first error with --continue_arg "fillChains" .

This is my log file: make_chains.log

Any advice?

Regards,
Alejandro

Fragmented genomes

Hi everyone,

Do you think I can run make lastz chains in fragmented genomes? One has 757 scaffolds and the other has 60,750. I already used RAGTAG to improve the assembly and they are already masked. I am using a 32 CPU server to run the analysis and I am planning to run TOGA too.

Thanks in advance

Pipeline failed at last clean_chains step

Hello, got an error at last Clean chains step: ### Clean Chains Step ###

Chains were filled: using /users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/deschrambler_test_files/temp_chain_run/human.cow.filled.chain.gz as input
Chain to be cleaned saved to: /users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/deschrambler_test_files/temp_chain_run/human.cow.before_cleaning.chain.gz
An error occurred while executing clean_chains: 'str' object has no attribute 'removesuffix'

I understand that script has some problems with .gz suffix in chain file, but I do not understand how to solve it. Here is server log Chains were filled: using /users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/deschrambler_test_files/temp_chain_run/human.cow.filled.chain.gz as input
Chain to be cleaned saved to: /users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/deschrambler_test_files/temp_chain_run/human.cow.before_cleaning.chain.gz
An error occurred while executing clean_chains: 'str' object has no attribute 'removesuffix'
Traceback (most recent call last):
File "/users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/modules/step_manager.py", line 70, in execute_steps
step_result = step_to_function[step](params, project_paths, step_executables)
File "/users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/modules/pipeline_steps.py", line 88, in clean_chains_step
do_chains_clean(params, project_paths, executables)
File "/users/hpctestuser/mprylutskyi/makechains/make_lastz_chains/steps_implementations/clean_chain_step.py", line 31, in do_chains_clean
_output_chain = input_chain.removesuffix(".gz")
AttributeError: 'str' object has no attribute 'removesuffix'

SGE job completed on Wed Nov 29 17:47:50 GMT 2023

I would be grateful for any help

TODO: add param to customise queue parameter

Inspired by:
hillerlab/TOGA#109

Running local using multiple threads (CPUs)?

Hello, @kirilenkobm and @MichaelHiller,
Thanks for creating this tool - I was looking for an option to run lastz genome-to-genome alignment on multiple threads and ended up here.

I was submitting a job to a single SGE node that has 16 CPUs. Since I am using only one machine and not going to qsub from the node, I used the default ( --executor local ) option (--executor sge switched back to local anyway when it realized qsub is not allowed in the node).

Nextflow documentation says

The local executor is used by default. It runs the pipeline processes in the computer where Nextflow is launched. The processes are parallelised by spawning multiple threads and by taking advantage of multi-cores architecture provided by the CPU

But looking at the temp. folders and files during the run, I am not sure whether all available threads are being used. It seems like the query sequence is divided to 4 chunks (this might be decided by the chunk size settings rather than number of threads, I guess) and being aligned to the target. But the 4 chunks appears to be aligned one after another, rather than in parallel.

...

The following was the command line I used:

make_chains.py Dme Dsi Dme_genomic.fna Dsi_genomic.fna --project_dir DmeDsi_1stAttempt

I also tried to convey that there are 16 CPUs, but this ran exactly the same way (four chunks of query being processed one after another):

make_chains.py Dme Dsi Dme_genomic.fna Dsi_genomic.fna --executor local --cluster_parameters "cpus=16" --project_dir DmeDsi_1stAttempt2

Is there a way to run this pipeline on multiple threads in parallel on a local machine? Or is each of the four chunks somehow already being aligned using all available CPU threads?

Thanks again!

hillerlab / make_lastz_chains Goto Github PK

make_lastz_chains's People

Contributors

Stargazers

Watchers

Forkers

make_lastz_chains's Issues

Lastz Alignment Step

Nextflow process lastz finished successfully

Concatenating Lastz Results (Cat) Step

Concatenating Lastz Results (Cat) Step

SGE job completed on Wed Nov 29 17:47:50 GMT 2023

Recommend Projects

Recommend Topics

Recommend Org