Giter Site home page Giter Site logo

chadlaing / panseq Goto Github PK

View Code? Open in Web Editor NEW
43.0 12.0 14.0 17.03 MB

Pan-genomic sequence analysis

Home Page: http://lfz.corefacility.ca/panseq

License: GNU General Public License v3.0

Perl 0.66% CSS 0.01% HTML 99.33%
pan-genome genome-sequencing snps accessory-genome core-genome comparative-genomics

panseq's Introduction

Master branch build status

OVERVIEW

Panseq determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters. It readily extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions, constructs files for use in phylogeny programs based on both the presence/absence of accessory regions and SNPs within core regions.

It also provides a loci selector that efficiently computes the most discriminatory loci from a tab-delimited dataset.

If you find Panseq useful please cite:

Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VP. BMC Bioinformatics. 2010 Sep 15;11:461.

USAGE

The Panseq standalone script can be accessed from:

lib/panseq.pl

The loci finder can be accessed from:

lib/loci_selector.pl

SETUP

Panseq requires Perl 5.10 or greater, and the following CPAN package to be installed:

Module::Build

This package can be installed from the command-line using cpan -i Module::Build. Following this, to install, and automatically retrieve the required CPAN packages for Panseq, do the following:

perl Build.PL
./Build installdeps

The following free, external programs must also be installed:

and optionally

Testing your installation

perl t/output.t

This will run a test suite against the included test data to ensure that Panseq is configured and working correctly. All tests should pass. The cd-hit tests will only be run if cd-hit is found. Panseq checks for the installed programs on the local path, but they can be optionally specified as follows:

perl t/output.t --blastDirectory '/home/user/local_blast/' --mummerDirectory '/home/user/local_mummer/' --cdhitDirectory '/home/user/cdhit' --muscleExecutable '/home/dir/muscle_executable'

Running Panseq

All the adjustments to Panseq are made by modifying a tab-delimited configuration file, which is specified as the only argument to the script.

perl lib/panseq.pl settings.txt

Below is an example configuration file for panseq.pl:

queryDirectory	/home/phac/panseq/queryLarge/
referenceDirectory	/home/phac/panseq/referenceLarge/
baseDirectory	/home/phac/panseq/output/panseq2test/
numberOfCores	22
mummerDirectory	/home/phac/MUMmer3.23/
blastDirectory	/home/phac/ncbi-blast-2.2.29+/bin/
minimumNovelRegionSize	500
novelRegionFinderMode	no_duplicates
muscleExecutable	/usr/bin/muscle3.8.31_i86linux64
fragmentationSize	500
percentIdentityCutoff	85
coreGenomeThreshold	2
runMode 	pan

Advanced Options

queryFile	/home/phac/fileOfQuerySequence.fasta
cdhitDirectory  /home/phac/cd-hit/
storeAlleles	1
allelesToKeep	2
nameOrId	name
frameshift	1
overwrite	1
maxNumberResultsInMemory 	500
blastWordSize	11
nucB    200
nucC    65
nucD    0.12
nucG    90
nucL    20
cdhit   1
sha1    1

Settings and their [DEFAULTS]

  • queryDirectory [REQUIRED] should contain the full directory path of the folder where all of the query sequences you are interested in comparing reside. Panseq will use the entire contents of this folder.

  • baseDirectory [REQUIRED] is the directory where all the output from Panseq is placed, and should be the full directory path.

  • runMode [REQUIRED] can be either novel or pan, for novel-region finding and pan-genome analyses respectively.

  • referenceDirectory [OPTIONAL] should contain the full directory path of the folder where all of the reference sequences you are interested in comparing reside. During the identification of novel regions step, these reference sequences will be screened out.

  • queryFile [OPTIONAL] is an input file of fasta-formatted sequences. If the mode is set to pan, the sequences in the queryFile will be used instead of generation a pan-genome. Thus, the distribution and SNPs of the input sequences will be determined for all genomes in the queryDirectory. This can be useful, for example, for quickly generating a table of + / - values for a set of input genes against a set of genomes in the queryDirectory.

  • numberOfCores [1] sets the number of processors available to Panseq. Increasing this can reduce run times.

  • mummerDirectory [$PATH] specifies the full path to the folder containing the nucmer program.

  • blastDirectory [$PATH] specifies the full path to the blast+ bin directory.

  • cdhitDirectory [$PATH] specifies the full path to the cd-hit program directory.

  • muscleExecutable [$PATH] specifies the full path to the muscle executable file.

  • minimumNovelRegionSize [0] sets the size in bp of the smallest region that will be kept by the Novel Region Finder; all regions found below this value will not be kept.

  • fragmentationSize [0] when running in mode pan determines the size of the fragments that the genomic sequences are segmented into. When set to 0, no fragmentation of the input is done, which can be useful if specifying input via the queryFile option.

  • percentIdentityCutoff [85] when running in mode pan sets the threshold of sequence identity for determining whether a fragment is part of the core or accessory genome.

  • coreGenomeThreshold [3] defines the number of input sequences that a segment must be found in to be considered part of the core genome; multi-fasta files of a single genome are treated as a single sequence.

  • storeAlleles [0] if set to 1, will store the allele matching the query sequence for each of the genomes and output them to locus_alleles.txt.

  • allelesToKeep [1] if set, and if storeAlleles is set, determines the number of alleles per genome to keep, if multiple exist. They will be output to the locus_alleles.txt file, and every allele after the first will be appended with a _a# tag, where # is the allele number (eg. _a2).

  • nameOrId [id] determines whether the individual locus ID string of numbers is output, or the name based on the query sequence in the files binary_table.txt and snp_table.txt.

  • frameshift [0] includes frameshift only differences in SNP counts. Default behavior is to include only positions where there are also nucleotide differences. If gap-only differences are required, set this option to 1.

  • overwrite [0] determines whether or not the specified baseDirectory will be overwritten if it already exists. This will cause all data in the existing directory to be lost.

  • maxNumberResultsInMemory [500] sets the number of pan-genome results to process before emptying the memory buffers and printing to file. Set this number higher if you want to limit the number of I/O operations. If you run into memory issues, lower this number.

  • blastWordSize [20] sets the word size for the blastn portion of Panseq. For small values of fragmentationSize or percentIdentityCutoff, hits may be missed unless this value is lowered. (The default value for the blastn program is 11; Panseq sets this to 20 as the default).

  • nucB [200] sets the b parameter when running nucmer.

  • nucC [65] sets the c parameter when running nucmer.

  • nucD [0.12] sets the d parameter when running nucmer.

  • nucG [90] sets the g parameter when running nucmer.

  • nucL [20] sets the l parameter when running nucmer.

  • cdhit [0] determines whether or not cd-hit-est is run on the pan-genome before identifying the distribution of the pan-genome (and SNPs among core regions) among the input sequences. Percent identity cutoff for cd-hit-est is taken from percentIdentityCutoff.

  • sha1 [0] sets the header and ID for all analyses as the SHA1 hash of the sequence. In novel mode this will give the novel regions with the fasta header as the hash. In pan mode this will also set the fasta headers as the SHA1 hash of the sequence, and the ID column in the outputs will use the SHA1 hash.

Format of multi-fasta files

Panseq currently only accepts fasta or multi-fasta formatted files. More than one genome may be in a single file, but for all genomes consisting of more than one contig, a distinct identifier must be present in the fasta header of each contig belonging to the same genome. For example, you have just assembled a new genome and are eager to analyze it. Your file consists of a number of contigs, similar to:

>contig000001
ACTGTTT...

>contig000002
CGGGATT...

The unique identifier could be the strain name or anything else of your choosing, but it must be included using the "local" designation: lcl|unique_identifer|. To reformat the above contigs, find and replace all ">" characters in your multi-fasta file with >lcl|unique_identifer|. Thus, if the unique identifier were "strain1", the reformatted contigs would look as follows:

>lcl|strain1|contig000001
ACTGTTT...

>lcl|strain1|contig000002
CGGGATT...

Common database file formats are supported by default, such as ref|, gb|, emb|, dbj|, and gi| and do not need to be modified as described above. For legacy purposes, the name=|unique_identifier| is supported in addition to lcl|unique_identifier|. Please note that spaces are not permitted in the unique identifier. Only letters (A-Z, a-z), digits (0-9) and the underscore "_" are valid characters.

##Description of output files

  • accessoryGenomeFragments.fasta: based on the run settings, all pan-genome fragments that are considered "accessory".
  • binary.phylip: the presence / absence of the pan-genome among all genomes in the queryDirectory in phylip format.
  • binary_table.txt: the presence / absence of the pan-genome among all genomes in the queryDirectory in tab-delimited table format.
  • core_snps.txt: based on the run settings, a tab-delimited, detailed results file of all SNPs found. Includes genome name, contig name, nucleotide variant, and base-pair position.
  • coreGenomeFragments.fasta: based on the run settings, all pan-genome fragments that are considered "core".
  • Master.log: the log detailing program execution.
  • pan_genome.txt: based on the run settings, a tab-delimited, detailed results file of all pan-genome regions. Includes genome name, contig name, presence / absence, and base-pair position for the pan-genome regions.
  • panGenome.fasta: the non-fragmented pan-genome for the genomes in queryDirectory.
  • panGenomeFragments.fasta: the fragmented pan-genome based on the fragmentationSize parameter.
  • phylip_name_conversion.txt: the genomes in the phylip file are labeled as sequential numbers. This file maps the numbers back to the original names given in the input fasta files. Can be used by the lib/treeNumberToName.pl script to automatically convert a newick file labeled with numbers to the appropriate genome names.
  • snp.phylip: a concatenated alignment of all SNPs found in the "core" genome regions for all genomes in the queryDirectoryl
  • snp_table.txt: the nucleotide values for all SNPs found in the "core" genome regions in tab-delimited table format.

##Detailed explanation of Panseq

###Novel Region Finder The Novel Region Finder currently has two modes implemented: "no_duplicates" and "unique". The no_duplicates mode identifies any genomic regions present in any of the query sequences that are not present in any of the reference sequences, and returns these regions in multi-fasta format. The "unique" mode finds genomic regions that are unique to each of the query sequences and not present in any of the reference sequences.

These comparisons are done using the nucmer program from MUMmer 3, the parameters of which can be adjusted by the user prior to submitting an analysis. A sample of the output from novelRegions.fasta looks as follows:

>lcl|strain1|contig000001_(505..54222)
CCGTACGGGATTA...

Where the name of the contig containing the novel region is listed, followed by the nucleotide positions that were determined to be "novel" based on the comparison run. Lastly, the corresponding novel nucleotide sequence is included.

###Core / Accessory Analysis

All of the query strains are used to determine a non-redundant pan-genome. This is done by choosing a seed sequence for the pan-genome and iteratively building the pan-genome by comparing non-seed sequences to the "pan-genome", using the Novel Region Finder described above. For each comparison, sequences not present in the "pan-genome" are added, and the expanded "pan-genome" is used for the comparison against the next sequence. This iteration continues until a non-redundant pan-genome has been constructed.

Following the creation of this pan-genome for the selected sequences, the pan-genome is segmented into fragments of user-defined size. These fragments are subsequently queried against all of the sequences in the query list using blastn. The presence or absence of each pan-genome fragment is determined for each query, based on the Sequence Identity threshold set by the user. Pan-genome fragments present in a minimum number of genomes (determined by the core genome threshold) are aligned using Muscle.

Single nucleotide polymorphisms (SNPs) in these alignments are determined and used to generate a Phylip formatted file of all SNPs for use in downstream phylogenetic analyses. A phylip formatted file of pan-genome fragment presence / absence is also created, as are tab-delimited tables for both SNP and pan-genome fragment presence / absence (snp_table.txt and binary_table.txt, respectively), and a detailed result file listing the names, positions and values of the SNP and binary data (core_snps.txt and pan_genome.txt).

The detailed results file provides data in six columns: Locus Id, Locus Name, Genome, Allele, Start bp and Contig. Locus Id will be a 10-digit number that is a unique identifier for the locus; this number is used in the tabular output for both the SNP and binary data, and can be used for cross-referencing. Locus Name and Genome provide the human-readable names for the locus and the Genome. The Allele columns lists the actual data for the comparison. For pan-genome fragment presence / absence this is binary "0" or "1" data. For the SNP table, this is "A", "C", "T", "G" or "-". Start bp refers to the nucleotide position of the locus in base pairs, for example "45933"; the nucleotide position information is for the start of the fragment for the binary data. Lastly, the Contig column lists the name of the contig the locus is found on, which may differ than the Genome column, for genomes comprised of multi-fasta files.

##Running the loci selector

The loci selector takes two command line arguments, with an optional third:

perl loci_selector.pl input_file number_of_loci maximize_pod > output_file

The number_of_loci can be any positive integer, or 'best', which will find the minimum number of loci to provide a unique fingerprint for each data column, if possible. The default value of maximize_pod is 0, but can be set to 1 to disable masking of previously used locus pairs when calculating the points of discrimination. The default is recommended. Output is to STDOUT, and can be redirected to output_file.

Detailed explanation of loci selector

The loci selector constructs loci sets that are maximized with respect to the unique number of fingerprints produced among the input sequences as well as the discriminatory power of the loci among the input sequences. The final loci set is iteratively built, in the following steps, given a tab-delimited table with loci names in the first column, sequence names in the first row, and single character data filling the matrix. Missing data is denoted by the characters '?', '-', or '.' :

  • (1) Each potential available locus is evaluated for the number of unique fingerprints that would result from its addition to the final loci set. All loci that would generate the maximum number of unique fingerprints in this respect are evaluated in step (2).

  • (2) All loci from step (1) are evaluated for their discriminatory power among the sequences, which is given as points of discrimination (POD). The POD for a locus is calculated as follows. A listing of all possible pair-wise comparisons is constructed; for example, if the input table consisted of three sequences, A, B and C, the list would consist of A-B, A-C and B-C. Next, it is determined whether or not the sequences in each pair-wise comparison contain the same single character denoting the locus state. If they do, a value of 0 is assigned; if they differ a value of 1 is assigned. The POD is then the summation of all pair-wise comparisons that differ for that locus. With our previous example, if A-B = 1, A-C = 1 and B-C = 0, the POD for that locus would be 2.

  • (3) The locus with the highest value from step (2) is selected for addition to the final loci set and removed from the pool of candidate loci. If two or more loci tie in value, one is randomly selected. If all possible unique fingerprints have been found, the algorithm continues with (4); if additional unique fingerprints are possible, the algorithm continues with (5).

  • (4) Sequence pairs for which the allele of the locus chosen in (3) differ are temporarily excluded from the analysis ("masked"). This ensures loci that differ between other pairs of strains are preferentially considered. Consider our A, B and C example with pair-wise comparisons of A-B = 1, A-C = 1 and B-C = 0. In the case of this locus being chosen, the sequence pairs A-B and A-C would be temporarily removed from the analysis ("masked"), leaving only loci that differed between B-C as viable options. Setting maximumPod = 1 prevents this masking step, which can be useful if one is only interested in loci that offer the most discrimination, regardless of what locus pairs offer that discrimination.

  • (5) Once a locus has been chosen:

    • a) the specified number of loci has been reached (all unique fingerprints in the case of 'best') and the algorithm terminates; or

    • b) the specified number of loci has not been reached and there are remaining fingerprints possible, or sequence pairs for which differences exist. The algorithm returns to (1); or

    • c) there are no remaining fingerprints possible and no sequence pairs for which differences exist. At such time, all sequence pairs are again considered part of the analysis ("unmasked"). If no differences among any sequence pairs exist at this point, the algorithm terminates; if differences remain, the algorithm returns to (1).

panseq's People

Contributors

chadlaing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

panseq's Issues

# Genomes

Hello,

I hope you all are doing well.

I am trying to run >1,000 genomes using panseq and have adapted the contig headers to the style listed in the README file, but I still seem to be getting >40,000 genomes for the run. I have listed an example of the headers found in one of my fasta files below. Any help would be greatly appreciated. Thanks!

$ grep ">" GCA_003398285.1_ASM339828v1_genomic_clean.fna | head

lcl|GCA_003398285.1_ASM339828v1_genomic|contig1
lcl|GCA_003398285.1_ASM339828v1_genomic|contig2
lcl|GCA_003398285.1_ASM339828v1_genomic|contig3
lcl|GCA_003398285.1_ASM339828v1_genomic|contig4
lcl|GCA_003398285.1_ASM339828v1_genomic|contig5
lcl|GCA_003398285.1_ASM339828v1_genomic|contig6
lcl|GCA_003398285.1_ASM339828v1_genomic|contig7
lcl|GCA_003398285.1_ASM339828v1_genomic|contig8
lcl|GCA_003398285.1_ASM339828v1_genomic|contig9
lcl|GCA_003398285.1_ASM339828v1_genomic|contig10

Best,
Chris

MakeBlastDB requires

Hi!

Panseq cuts out for me citing:

2014/04/03 12:32:00 INFO | SegmentMaker.pm:155> Segmenting /pub19/matthew/Panseq/Panseq-master/BasenucmerTempFile2ThuApr31230132014493417855614_pan_final_novelRegions_final_novelRegions into 500bp segments
2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57> Modules::Alignment::MakeBlastDB requires

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57>

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57> 'dbtype'

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57>

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57> 'out'

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57>

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57> 'title'

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57>

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57> 'in'

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57>

2014/04/03 12:32:01 FATAL | MakeBlastDB.pm:57> 'logfile'

I've checked the scripts and can't discern any issue and the blast directory is correct and makeblastdb is working.

Many many thanks for any help with this advance!!

Best regards,

Matt Moore.

loci_selector.pl

Hello

while runnig loci_selector.pl I have a bunch of warnings

it was runn with pan_genome_2.3.0.txt from test data

Use of uninitialized value in concatenation (.) or string at /local/gensoft2/exe/Panseq/3.2.1/lib/perl5/Modules/LociSelector/LociSelector.pm line 483.

and

Use of uninitialized value in concatenation (.) or string at /local/gensoft2/exe/Panseq/3.2.1/lib/perl5/Modules/LociSelector/LociSelector.pm line 250

and

Argument "best" isn't numeric in numeric eq (==) at /local/gensoft2/exe/Panseq/3.2.1/lib/perl5/Modules/LociSelector/LociSelector.pm line 200.

regards

Eric

Add LocusName to snp_table

Hi Chad,

Could an option be added whereby the LocusName is added to the snp_table.txt matrix? I'm attempting to analyse snps per fragment but the core_snps.txt can become massive!

The output I have in mind would look like this:

<style> </style>
    genome1 genome2 genome3 genome4
LocusName1 4567 A T C G
LocusName1 10384 A T C G
LocusName2 12475 A T - G
LocusName2 14569 - G C G

Thanks in advance for any help with this!

EDIT: I realise now that the positions are not included already. So it would be to add the LocusName's and positions. For my purposes the LocusName's alone would be sufficient

Multifasta input option

Hi! Not so much an issue but a preference I'd like to inquire about:

As the software provides the option to provide multi-fasta files representing a genome respectively or containing a number of genomes ahead of being combined at the start of the pipeline the fasta headers need to be quite different.

I'm working with a lot of files and a lot of the identifiers are very similar, recently 213 genomes were thought to be 50 for example. Is there a way to disable this function?

Thanks for your time!

Matt.

Error while using treeNumberToName.pl not able to match the values in snp.phylip and phylip_name_conversion.txt

I ran the following perl script
perl /apps/eb/Panseq/3.2.1-foss-2019b-Perl-5.30.0/lib/treeNumberToName.pl out/snp.phylip out/phylip_name_conversion.txt > out_tree
Gives the following error
Could not match 1
Could not match 2
Could not match 3
Could not match 4
Could not match 5
Could not match 6
Could not match 7
Could not match 8
Could not match 9
Could not match 10

less phylip_name_conversion.txt
Number Name
1 PANS_1_10
2 PANS_1_2
3 PANS_1_5
4 PANS_1_6
5 PANS_1_8
6 PANS_1_9
7 PANS_200_1
8 PANS_200_2
9 PANS_2_1
10 PANS_2_5

less snp.phylip
81 2112
1 CGGGACTGCGTTCTTGCAGTACAAGGAAGGACGGGCGCAACCGGTGCTGTCAATATCACAGGTCTCCTTTAGATC--TACGCGACCGGAACAGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGCCACATCCGGTCGCGTATATGTCGCTGGGCTAGATGGTCCACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGGTCGATAAATGCGCTATTCCGTTTTCTCCTCTCAGCGTCATTATCTTTCCACTGATTTGCGCCCGGGGATGAGCCATCGACGCGCGTCTTCGCGTACACAAGTCGAAGGTAACAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCTATATCTCTATTTACTTCAATCGCCTCCGCCTTTGGATTCCCCTGTGCTATCTTACAGGAGACGGGTATCTGAAGGGATGCGGCTCGGTTTTGCCTGTGCCCGGGACCTAGTTCTGGCGGAACGCTTTCGTTGCTTGGGTACCCTGCACACTGACTTCC-CACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAATGCTTGGCGTCGCCTTCCCTGTTGCAAGTCGCGGATAGTCACGAATAATCGAGAGCTTGGATAACGAAAGATGGTTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTCCCCGGAAGCGACCACATGTTGATCCCGACTGTGGGTACAAGGCGGGTATGCAGTCCCCGATCTATGCCAGAAGATGTTTGCCGTCCGTCTGCCACTATATTCCCGGCTCCCATGCTGGTACAACCCCCTGATTTCTACAGCCACATGGACAGTTTGACCTAGAAACTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGCATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGTGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATATCGC-CTACAGCGTTTGAAGTTTCCTGAACCCCGCCGCGGGTCTATTTTCAGACACGCTCCCCGACGCGAGTACGGCAAGCGATGCGCTCATCCGTCTGACGCTGCCAACCCCCCCCCGTCGG--TATAAATGCTAGGTTAAGAGACCAAGATACAAGATGGACCTGGGAACACTGCTCCTCCCCGCCTAATTTTCGCCCTAATCCTGTCTTCTGGTCCTGCCCAGTTATTATAC-ACCCGTTGGCTTATGTCGCGCATTGCATAACGATCTCACAGCTACCCCGACGAGCCGAACGGCGTTCACTTGCTTGTATTCGCGGATAGTGTTGAAAAGCTACCCCTGGCGCCTATCCCGAGTCGTAGCCAATAGGCGGGCGCAACCGGATACTTTCTAGTGCGGATTGTACGCGACCGGTCACGCTCGGTCCCAACCGCCTCAGTCATTCTCATAGCGCTCGTGCCTCCGTTCTTCTGCATTCCTCATCGGATCCTGGTGAATGTCCCAATCCGCATTCTCTCGGTCCGCCGAGATCTCTGGCTGGCCGGGACGTCCGGGGGGAGCCCCATGTCGGCCGCCTAGGGGCCGGTCCGTGCCAATGCGACAAATCCCTGCGCGATCCAGGGCCGGCGCACACCACAACGCGGTCTCCCGCCGCGGCCCCCGTACTGCCGAGCGGCTGGGGGTCGTGGTGGACGAACGCGGCTGCCCCCACGAGGCTTTACTGGTGCGAGTCGGCCTCCCGATAGAAGGTCTGGCGTACAGTGAAAAAGAGAGGGGGAGCTTCTCCACGAAGCAGGGGAGAAGGGGAGTAGCGTGTGTAGGGTACTCGTCGGGAGAGCGGGGGGCGCGCGGGGGGACGCGAGAGGTGGCAGAATGCTCCGGCCCCCGGGGGCGGGCTAGCGAAAGGGAGCCTCGCGCGGTTGCAATTTTCGAGCTCTCGATATACGGTCTATCTTTCGAAAAGGGAAAAATTGGGTTTATGGCTCCTTTAATATATAAAAAAAGACATAAATAAACAAACGAACTCGCAAAC
2 TGGGACTGCGTTCTTGCAGTACAAGGAAGGACGTACGCAACCGGAGCTGCCAATATCACAGGTCTCCTTTAGATC--TACGCAACCGGAACAGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGCCACATCCGGTTGCGTATGCATCGCTGGGCTAGATGGTCCACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGATCGATACATGCGCTATTCCGTTTTCTTCTCTCAGCGTCATTATCTTTCCACCGATTTGCGTACGGGGATGAGTCATCGACGCGCGTCTCTGCGTACACAAGTAGAAGGTAACAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCTATATCTCTATTTACTTCAATCGCCTCCGCCTTTGGATTCCCCTGTGCCGTCTTACAGGAGACAGGTATCTGAAGGGATGCGGCTCGGCTTTCCCTGTGCCCGGGACCTAGCTCTGGCGGAACTCTTCCGTTGCTTGGGTACCCTGCACACTGACTTCT-CACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAATGCTTGGCGTCGCCTTCCCTGTTGCAAGTCCCGCATCGCCACGAATAATCGAGAGCTTGGATAACGGAAGATGATTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTCCCCGGAAGCGACCGTATATTGATCCCGCCTGTGGGTACAAGGCGGGTATGCAGTCCCCGATCTCTGCCAGAAGATGCTTGCCGTCCGTCCGCCACTATATTCCCGGTACCCGTGCTGGTACAACCCCCTGATTTCTACAGCCACATGGACAGTTTGACCTAGAAGCTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGTATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGCGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATATCGC-CTCCAGCTTTTGGAGATTCTTGAACCCCGCTGCGGGTAGATTTTCCGACACGCTCCCCGACGCGAGTACGGCAAGCGATGCGCTCATCCGTCTGACACTGCCAGCCCCCTCCCGTCGG--TATAAATGCTAGGTTAAGAGACCAAGATACAAGATGGACCTGGGAACACTGCTCTTCCCCGCCTAATTTTCGCCCTAATCTTGTTTTCTGGTCCTGCCCGGTTACTATACTATCCGTTGGCTTATGTCGCGCATTGCATAACGATCTCACAGCTACCCCGACGAGTCGAACGGCGTTCACTTGCTTGTATCCCCGCATCGCGTTGAAAAGCTGCCCCCGGCGCCTATCCCGAGTCGTAGCCAACAGGCGTACGCAACCGGATACTTTCTAGTGTGGATCGTACGCAACCGGTCACGCTCGGTCCTAACCGCCTCAGTCATTCTCATAGCGCTCGTGCCTCCGTTCTTCTGCATTCCTCATCGGATCCCGGTGAAAGTCCCAATCCGCATTCTCTCGGTCCGCCGAGATCTCTGGCTGGCCGGGACGTCCGGGGGGAGCCCCATGTCGGCCGCCTAGGGGCCGGTCCGTGCCAACTCGGCTAGTCCTCGCGCAATCCAAGGCAGGCGCACACTACAACGCGGGCGCCCGCCGCTGTCCCTGCGTTGTCGGGCGGCCGTAGGTCGGGGTGGACGAAAGCGGCTACCCCCACGCGATCGTGCTGGTGCGAGTCGGCTTCTCGATAGTAGGTCCGGCGTACGGCGGACAAAAGGGGGGGAACTTCCCCACGGAGCGGGGGAGGCGAGGGGTGGCGTGCGTGGGGTATTCGTCGGGAGAGCGGGGGGCACGCGGAGGGACGCGAGCGAAGGCAGAATGCCCCAGCCCCCGAGGGCGGACTGGCGAAGGGGAGCCCCGCACAGTTGCAATTTTCGAGCTCTCGATATACGGTAGATCTTTCGAAAAGGGAAAAATTGGTCTCGCAAACCCTTTAATATATAAGAAAAGACATAAATAAACAAACGAACTCGCAAAC
3 CGGGACTGCGTTCTTGCAGTACAAGGAAGGACGGGCACAACCGGTGCTGCCAATATCACAGGTCTCCTTTAGATC--GGCACAACCGGAACGGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGTCACATCCGGTTGTGCCTATGTCGTTGGGCTAGATGGTCCACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGGTCGATAAATGCGCTATTCCGTTTTCTCCTCTCAGCGTCATTATCTTTCCACCGATTTGCGTACGGGGATGAGCCATTGACGCGCGTCTTTGCGTACACAAGTAGAAGGTACCAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCCATATCTCGATTTACTTCAATCGCCTCCGCCTTTGGATTCCCCTGTGCTATCTTACAGGAGACGGGTATCTGAAGGGATGCGGCTCGGCTTTCCCTGTGCCCGGGACCTAGCTCTGGCGGAACTCTTTCGTTGCTTGGGTACCCTGCACACTGACTTCC-CACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAACGCTTAGCGTCGCCTTCCCTGTTGCAAGTCCCGCATAGCCACGAATAATCGAGAGCTTGGATAACGGAAGATGATTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTCCCCGGAAGCGACCACAAGTTGATCCCGCCTGTGGGTACAAGGCGGGTATGCAGTCCCCTATCTATGCCAGAAGATGTTTGCCGTCCGTCCGCCACTATATTCCCGGCTCCCATGCTGGTACAACCCCCTGATTTCTACAGCCACATGGACAGTTTGACCTAGAAGCTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGTATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGCGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATACCGC-TTCCAGCGTTTGAAGTTCCTTGATCCCCGCCGCGGGTCTATTTTCCGACACGCTCCCCGACGCGAGTACGGCAAGCGATGCGCTCATCCGTCTGACACTGCCAGCCCCCCCCCGTCGGT-TATAGGTGCTAGGTTAAGAGACCAAGATACAAGATGGACCTGGGAACACTGCCCTTCCCCGCCTAATTTTCGCCCTAATCCTGTCTTCTGGTTCTGCCCAGTTACTATAC-ACCCGTTGGTTTATGT-GCGCGCTGCATAACGATCTCACAGCTACCCCGACGAGTCGAACGGCGTTCACTTGCTTGTATTCCCGCATAGCGTTGAAAAGCTG-TTCTGGCGCCTATCCCGAGTCGTAGCCGACA-GCGGGCACAACCGGATACTTTCTAGTGTGGATCGGGCACAACCGGTCACGCTCGGTCCCAACCGCCTCAGTCATTCCCATAGCGCTCGCGCCTCCGCTCTTCTGCATTCCTCATCGGATCCTGGCGAATGTCCCAGCCCGCGCCCGCCCGGTCCGCCGAAGTTTCTGGAGAGCCGGGACGTCCGGGGGAAGCCCCATGTTGGCCGCCTAGGGGCCGGTCCGTACTAGCTGGGCTAGTCCTCGCGCGGCGCGAGGCCGTCGCACACCACGACGCGGGCTCCCACCGCGGCCCCTGCGTTGTCGGGCGGCCGTGGGGCGGGGCGGGCGCACGCAGCTACCCCCACGCGATCGTGCTGGTGCGAGTCGGCTCCTCGGTAGTAGATCCGGCGTACAGTGAAAAAGACGGGAGGCACTTCTCCACGAAACAGGGAAGACGGGGAGTAGCATGTGCAGGGCGTTTATCGGGGGAGCGGGGGGCGCGCGAGAGGGCGCGAGCGATGGCAGAATGCTCCGGCCCCCGGGGGCGGGCTGTCGTAGGGGAGCCTCACGCGGTTGCAATTTTCGAGCTCTCGATATATGGTCTATCTTTTCATAAGGGAAAAATTGGGTTTATGGCTCCTTTAATATATAAAAAAAGACATAAATAAACAAACGAACTCGCAAAC
4 CGGGACTGCGTTCTTGCAGTACAAGGAAGGACGGGCACAACCGGTGCTGCCAATATCACAGGTCTCCTTTAGATCGTGGCACAACCGGAACGGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGTCACATCCGGTTGTGCCTATGTCGTTGGGCTAGATGGTCCACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGGTCGATAAATGCGCTATTCCGTTTTCTCCTCTCAGCGTCATTATCTTTCCACCGATTTGCGTACGGGGATGAGCCATTGACGCGCGTCTTTGCGTACACAAGTAGAAGGTACCAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCCATATCTCGATTTACTTCAATCGCCTCCGCCTTTGGATTCCCCTGTGCTATCTTACAGGAGACGGGTATCTGAAGGGATGCGGCTCGGCTTTCCCTGTGCCCGGGACCTAGCTCTGGCGGAACTCTTTCGTTGCTTGGGTACCCTGCACACTGACTTGTCCACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAACGCTTAGCGTCGCCTTCCCTGTTGCAAGTCCCGCATAGCCACGAATAATCGAGAGCTTGGATAACGGAAGATGATTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTCCCCGGAAGCGACCACAAGTTGATCC--CCTGTGGGTACAAGGCGGGTATGCAGTCCCCTATCTATGCCAGAAGATGTTTGCCGTCCGTCCGCCACTATATTCCCGGCTCCCATGCTGGTACAACCCCCTGATTTCTACAGCCACAT-GACAGTTTGACCTAGAAGCTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGTATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGCGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATACCGC--TCCAGCGTTTGAAGTTCCTTGATCCCCGCCGCGGGTCTATTTTCCGACACGCTCCCCGACGCGAGTACGGCAAGCGATGCGCTCATCCGTCTGACACTGCCAGCCCCCCCCCGTCGGTGTATAAGTACTAGGTTAAGAGACCAAGATACAAGATGGACCTGGGAACACTGCCCTTCCCCGCCTAATTTTCGCCCTAATCCTGTCTTCTGGTTCTGCCCAGTTACTATACTACTTGTTGGCTTATGTCGCGCGCTGCATAACGATCTCACAGCTACCCCGACGAGTCGAACGGCGTTCACTTGCTTGTATTCCCGCATAGCGTTGAAAAGCTG-TTCTGGCGCCTATCCCGAGTCGTAGCCAA---GCGGGCACAACCGGATACTTTCTAGTGTGGATCGGGCACAACCGGTCACGCTCGGTCCCAACCGCCTCAGTCATTCCCATAGCGCTCGCGCCTCCGCTCTTCTGCATTCCTCATCGGATCCTGGCGAATGTCCCAGCCCGCGCCCGCCCGGTCCGCCGAAGTTTCTGGAGAGCCGGGACGTCCGGGGGAAGCCCCATGTTGGCCGCCTAGGGGCCGGTCCGTACTAGCTGGGCTAGTCCTCGCGCGGCGCGAGGCCGTCGCACACCACGACGCGGGCTCCCACCGCGGCCCCTGCGTTGTCGGGCGGCCGTGGGGCGGGGCGGGCGCACGCAGCTACCCCCACGCGATCGTGCTGGTGCGAGTCGGCTCCTCGGTAGTAGATCCGGCGTACAGTGAAAAAGACGGGAGGCACTTCTCCACGAAACAGGGAAGACGGGGAGTAGCATGTGCAGGGCGTTTATCGGGGGAGCGGGGGGCGCGCGAGAGGGCGCGAGCGATGGCAGAATGCTCCGGCCCCCGGGGGCGGGCTGTCGTAGGGGAGCCTCACGCGGTTGCAATTTTCGAGCTCTCGATATATGGTCTATCTTTTCATAAGGGAAAAATTGGGTTTATGGCTCCTTTAATATATAAAAAAAGACATAAATAAACAAACGAACTCGCAAAC
5 CGGGACTGCGTTCTTGCAGTACAAGGAAGGACGGGCGCAGCCGGAGCTGCCAATATCACAGGTCTCCTTTAGATC--GGCGCAGCCGGAACGGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGCAACATCCGGCTGCGCCTATGTCGCTGGGCTAGAAGGTCCACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGGTCGATACATGCGCTATTCCGTTTTCTCCTCTCAGCGTCATTATCTTCCCACCGATTTGCGTACGGGGATGAGCCGTCGACGCGCGTCTCTGCGTACACAAGTCGAAGGTACCAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCTATATCTCTATTTACTTCAATCGCCTCCGCCTTTGGATTCCCCTGTGCCGTTTTACAGGAGACGGGTATCTGAAGGGATGCGGCTCGGCTTTCCCTGTGCCCGGGACCTAGCTCTGGCGGAACGCTTCCGTTGCTTGGGTACCCTGCACACTGACTTCT-CACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAATGCTTGGCGTCGCCTTCCCTGTTGCAAGTCGCGGATAGTCACGAATAATCGAGAGCTTGGATAACGGAAGATGGTTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTCCCCGGAAGCGACCGCATGTTGATCCCGCCTGTGGGTACAAGGCGGGTATGCAGTCCCCGATCTATGCCAGAAGATGTTTGCCGTCCGTCCGCCACTATATTCCCGGCTCCCATGCTGGTACAACCCCCTGATTTCTACAGCCACATGGACAGTTTGACCTAGAAGCTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGTATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGTGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATATCGCGCTCCAGCGTTTGGAGATTATCGTACCCCGCCGCGGGTCTATTATCCGACACGCCCCCCGGCGCGAGTACGGCAAGCGTTGCGCTCATCCGTCTGACACTGCCAACCCCCCTCCGTCGGT-TATAAATGCTAGGTTAAGAGACCAAGATACAAGATGGATCTGGGAACACTGCCCTTCCCCGCCTAATTTTCGCCCTAATCTTGTTTTCTGGTCCTGCCCGGCTACGATCC-ACCCGCTGGCCTATGT-GCGCATTGCATAACGATCTCATAGCTACCCCGACGAGCCGAACGGCGTTCACTTGCTTGTATCCGCGGATAGTGTTGAAAGGCTGCCCCTAGCGCCTATCCCGAGTCGTAGCCAACA-GCGGGCGCAGCCGGATACTTTCTAGTGTGGATCGGGCGCAGCCGGTCACGCTCGGTCTCAACCGCCTCAGTCATTCCGATAGCGCTCGCGCCTCCGTTCTTCTGCATTCCTCATCGGATCCCGGTGAAAGTCCCAATCCGCATTCTCTCGGTCCGCCGAGATCTCTGGCTGGCCGGGACGTCCGGGGGGAGCCCCATGTCGGCCGCCTAGGGGCCGGTCCGTGCCAACTCGGCTAGTCCTCGCGCAATCCAAGGCAGGCGCACACTACAACGCGGGCGCCCGCCGCTGTCCCTGCGTTGTCGGGCGGCCGTAGGTCGGGGTGGACGAAAGCGGCTACCCCCACGCGATCGTGCTGGTGCGAGTCGGCTTCTCGATAGTAGGTCCGGCGTACGGCGGACAAAAGGGGGGGAACTTCCCCACGGAGCGGGGGAGGCGAGGGGTGGCGTGCGTGGGGTATTCGTCGGGAGAGCGGGGGGCACGCGGGGGGACGCGAGCAAAGGCAGAATGCCCCAGCCCCCGAGGGCGGGCTGGCGAAGGGGAGCCCCGCACAGTTGCAATTATTGAGCTCTCGATATACGGTCTATCTATTGAAAAGGGAAAAATTGGTCCCGCAAACCCTTTAATATATAAGAAAAGACATAAATAAACAAACGAACTCGCAAAC
6 CGGGACTGCGTTCTTGCAGTACAAGGAAGGACGGGCGCAACCGGTGCTGTCAATATCACAGGTCTCCTTTAGATCGTTACGCGACCGGAACAGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGCCACATCCGGTCGCGTATATGTCGCTGGGCTAGATGGTCCACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGGTCGATAAATGCGCTATTCCGTTTTCTCCTCTCAGCGTCATTATCTTTCCACTGATTTGCGCCCGGGGATGAGCCATCGACGCGCGTCTTCGCGTACACAAGTCGAAGGTAACAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCTATATCTCTATTTACTTCAATCGCCTCCGCCTTTGGATTCCCCTGTGCTATCTTACAGGAGACGGGTATCTGAAGGGATGCGGCTCGGTTTTGCCTGTGCCCGGGACCTAGTTCTGGCGGAACGCTTTCGTTGCTTGGGTACCCTGCACACTGACTTCC-CACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAATGCTTGGCGTCGCCTTCCCTGTTGCAAGTCGCGGATAGTCACGAATAATCGAGAGCTTGGATAACGAAAGATGGTTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTC--CGGAAGCGACCACATGTTGATCCCGACTGTGGGTACAAGGCGGGTATGCAGTCCCCGATCTATGCCAGAAGATGTTTGCCGTCCGTCTGCCACTATATTCCCGGCTCCCATGCTGGTACAACCCCCTGATTTCTACAGCCACATGGACAGTTTGACCTAGAAACTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGCATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGTGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATATCGC-CTACAGCGTTTGAAGTTTCCTGAACCCCGCCGCGGGTCTATTTTCAGACACGCTCCCCGACGCGAGTACGGCAAGCGATGCGCTCATCCGTCTGACGCTGCCAACCCCCCCCCGTCGG--TATAAATGCTAGGTTAAGAGACCAAGATACAAGATGGACCTGGGAACACTGCTCCTCCCCGCCTAATTTTCGCCCTAATCCTGTCTTCTGGTCCTGCCCAGTTATTATAC-ACCCGTTGGCTTATGTCGCGCATTGCATAACGATCTCACAGCTACCCCGACGAGCCGAACGGCGTTCACTTGCTTGTATTCGCGGATAGTGTTGAAAAGCTACCCCTGGCGCCTATCCCGAGTCGTAGCCAATAGGCGGGCGCAACCGGATACTTTCTAGTGCGGATTGTACGCGACCGGTCACGCTCGGTCCCAACCGCCTCAGTCATTCTCATAGCGCTCGTGCCTCCGTTCTTCTGCATTCCTCATCGGATCCTGGTGAATGTCCCAATCCGCATTCTCTCGGTCCGCCGAGATCTCTGGCTGGCCGGGACGTCCGGGGGGAGCCCCATGTCGGCCGCCTAGGGGCCGGTCCGTGCCAATGCGACAAATCCCTGCGCGATCCAGGGCCGGCGCACACCACAACGCGGTCTCCCGCCGCGGCCCCCGTACTGCCGAGCGGCTGGGGGTCGTGGTGGACGAACGCGGCTGCCCCCACGAGGCTTTACTGGTGCGAGTCGGCCTCCCGATAGAAGGTCTGGCGTACAGTGAAAAAGAGAGGGGGAGCTTCTCCACGAAGCAGGGGAGAAGGGGAGTAGCGTGTGTAGGGTACTCGTCGGGAGAGCGGGGGGCGCGCGGGGGGACGCGAGAGGTGGCAGAATGCTCCGGCCCCCGGGGGCGGGCTAGCGAAAGGGAGCCTCGCGCGGTTGCAATTTTCGAGCTCTCGATATACGGTCTATCTTTCGAAAAGGGAAAAATTGGGTTTATGGCTCCTTTAATATATAAAAAAAGACATAAATAAACAAACGAACTCGCAAAC
7 CGGGACTGCGTTCTTGCAGTACAAGGAAGGACGTACGCAACCGGTGCTGTCAATATCACAGGTCTCCTTTAGATCGTTACGCAACCGGAACAGAACAGAGCAGCCTACGACAGCTAATAAGTGTGGTGCCACATCCGGTTGCGTATGCATCGCTGGGCTAGATGGTTTACCTATGGAGATCCCGCAAAAATTTTGGGACCCCGCTCGGTGAAACTTTCTTCAAACGGGACTACGGTGACGGCGCGGATCGATACATGCGCTATTCCGTTTTCTCCTCTCAGCGTCATTATCTTTCCACCGATTTGCGTACGGGGATGAGCCATCGACGCGCGTCTTTGCGTACACTGAAAGAAGGTACCAACTTAGCTAGAATAACCGAAATGCACAAGTCTGGCTGGTAGTACCACCTCACTTTCCCCCATATCTCGATTTACTTCAATCGCCTCCGCCTTTGAATTCCCCTGTGCCGTCTTACAGGAGACGGGTATCTGAAGGGATGCGGCTCGACTTTGCCTGTGCCCGGGACCTAGTTCTGGCGGAACGCTTCCGTTGCTTGGGTACCCTGCACACTGACTTGT-CACGCTGAGCGATAGTAATGGTTAGACGGGGAAGGTAAAACAGACACAGAGAAGGGGCTCAATGCTTGGCGTTGCCTTCCCTGTTGCAAGTCCCGCATCGCCACGAATAATCGAGAGCTTGGATAACGGAAGATGGTTACCGAGGCAGTCAACCTCGATCCGTCATCGTCGACACATTTCCCCGGAAGCGACCGTATGTTGATCCCGCCTGTGGGTACAAGGCGGGTATGCAGTCCCCTATCTCTGCTAGAAGATGTTTGCCGTCCGTCTGCCACTATATTCCCGGCTCCCATGCTGGTACAACCCCCTGATTTCTACAGCCACAT-GACAGTTTGACCTAGAAGCTGGAGCCGTTAAGCGGCTCGACCTTCTATTCGAACGCGTGCTGTACGGGAATTGTATATCTCTGAGGTGACGCAGTCCAGTTCGGTTCGTCAGTTTGCGGCTCATGTCTCCCGCAGGCGACTGGTTGCTTCAAGGACGTACGAGTTATAATATCGC-CTCCAGCGTCTGGAGATTATCGTACCCCGCCGCGGGTCTATTATCCGACACGCTCCCCGACGCGAGTACGGCAAGCGATGCGCTCATCCGTCTGACACTGCCAGCCCCCTCCCGTCGG--TTTAGATGCTAGGTTAAGAGACCAAGATACAAGATGGACCTGGGAACACTGCCCTTCCCCGCCTAATTTTCGCCCTAATCCTGTCTTCTGGTCCTGCCCAGTTACTATACTATCCGTTGGCTTATGTCATGCATTGCATAACGATCTCATAGCTACCCCGACGAGCCGAACGGCGTTCACTTGCTTGTATTCCCGCATCGCGTTGAAAAGCTGCCCCTAGCGCCTATCCCGAGTCGTAGCCAATAAGCGTACGCAACCGGATACTTTCTAGTGCGGATCGTACGCAACCGGCCGCACGTATCCCTAACCGCCTCAGTCATTCCCATAGCGCTCGCGCCTCCGTTCTTCTGCATTCCTCATCGGATCCTGGTGAATGCCCCAGCCCGCGTTCTCTCGGTCCGCCGAGGTCTCTGGCGGGCCCGGACGTCCGGGGGAAACTCCATGTCGGCCGCCTAGGGGCCGGTCCGTGCCAGCTCGGCTAGTCCCTGCGCGGCGTGAGGTAGGCGCACACCACGACGCGGGCTCCCACCGCGGCCCCCGCGCTGCCGAGCGGCTGGGGGTCGGGGTGGACGAACGCGGCCGCCCCCACGCGATCTTGCTAGTAAGAGTCGGCCTATCGATAGTAGACCTAGCGTACAGTGAAAAAGAGAGGGGGAGCTTCTCCACGAAGCAGGGAAGGCGAGGAGTAGCG

Memory Requirements

Hi,
I am trying to use Panseq for 7 fungal genomes of 30 Mb on HPC. The allocated memory was 80GB and analysis failed due to memory limit. I can't figure out how much is the memory requirement for 7 genomes. Please guide.

SNP positions

Hi Chad,

I was interested in extracting a full core genome alignment from panseq by taking the core genome fragments and the core_snps.txt output to generate a multiple alignment for each gene, with each allele.

Panseq was run to generate 500bp fragments (and all the core genome fragments are 500bp) but in the core_snps.txt there are snps reported beyond position 500?

Thanks in advance for any help with this!!

Best regards,

Matt

unexpected char in string error

Hi Chad,

I'm trying to run panseq on some publically available genomes, and was successful when running the genomes from a subspecies. As soon as I included two other subspecies, I get an "unexpected char in string" error. Weirdly, this error is coming up in strains that were successful in the first run. Those characters do not exist in the input so I'm assuming its in a temp file the program is writing and then referring back to?

Below is an example from the Master log file (the top and bottom).

2019/12/10 14:29:27 INFO |  NovelIterator.pm:186> We have 74 genomes this run 
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
Unexpected character `7' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome
Unexpected character `4' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome
Unexpected character `4' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome


Unexpected character `.' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome_(1138930..1144388)
2019/12/10 14:30:00 WARN |  CombineFilesIntoSingleFile.pm:83> Skipping /PATH/vpd/syphilis/panseq/run2-all-strains/6665952b07be10cc3db02af26d6d6f3a_5616179e4a49b14a8e4caa454f9b6f58_NR as it has size of 0 
2019/12/10 14:30:00 INFO |  Panseq.pm:268> Panseq mode set as pan 
2019/12/10 14:30:00 INFO |  SegmentMaker.pm:164> Segmenting /PATH/vpd/syphilis/panseq/run2-all-strains/6665952b07be10cc3db02af26d6d6f3a_5616179e4a49b14a8e4caa454f9b6f58 into 500bp segments 

If I remove the isolate from the analysis I get even more of these errors, for several other isolates. Any insight would be awesome.

Thanks!
-Christy

show-coords called incorrectly in NucmerRun.pm

The show-coords command of MUMmer is called incorrectly (at least with respect to the verison of MUMmer I'm running) in NucmerRun.pm. The order of options and delta file input argument were reversed. Modified line 219 in NucmerRun.pm should be

my $coordsLine = $self->mummerDirectory . 'show-coords -l -q -T ' . $deltaFile . ' > ' . $self->coordsFile;

This bug at a minimum caused some tests in t/output.t to fail. I tested with Panseq commit ab3704d, MUMmer version 3.23, MUSCLE version 3.8.31, and BLAST version 2.2.28+. After the above modification all tests passed.

Minimum length of scaffolds?

Hello,

I was wondering if there is a minimum length requirement for scaffolds somewhere hard-coded in the program?

I performed a test where I compared the length of genomes before running Panseq on them and after (by summing up all of the fragments each genome had in the Pangenome, as well as the length of the fragments). For the most part these numbers agreed very well (1-2% difference), but with highly fragmented genomes (with a lot of small pieces) there was as much as a 13% reduction in genome size.

The smallest scaffolds I used as input are 1kb, but I was wondering if I should make that even higher?

Thank you for the wonderful tool,
-Matt

test fails for 6 genomes

I am not sure if this is the best place to post but I did not find any mailing list, etc. to ask this question.

I installed Panseq and all its prerequisites. When I try to run the test, it fails for 6 genomes:

Settings file: t/genomes.batch
2016/06/09 09:15:51 INFO | NovelIterator.pm:186> We have 6 genomes this run
2016/06/09 09:16:26 INFO | Panseq.pm:253> Panseq mode set as pan
2016/06/09 09:16:26 INFO | SegmentMaker.pm:164> Segmenting t/genomes/5cab9e25afd603ac843befcc870cc3be_9cfa6d9cba95e7e695ff73dc8e834bfe into 1000bp segments
2016/06/09 09:16:28 INFO | FastaFileSplitter.pm:127> Splitting t/genomes/pangenome_fragments.fasta into 1 files
2016/06/09 09:16:29 INFO | PanGenome.pm:208> Analyzing the pan-genome
2016/06/09 09:16:29 INFO | PanGenome.pm:213> Processing Blast output files
2016/06/09 09:16:29 FATAL | BlastResults.pm:62> No such file or directory
2016/06/09 09:16:30 WARN | CombineFilesIntoSingleFile.pm:82> Skipping t/genomes/1_coreGenomeFragments as it has size of 0
2016/06/09 09:16:30 WARN | CombineFilesIntoSingleFile.pm:82> Skipping t/genomes/1_accessoryGenomeFragments as it has size of 0
2016/06/09 09:16:30 WARN | CombineFilesIntoSingleFile.pm:82> Skipping t/genomes/1_locus_alleles as it has size of 0
2016/06/09 09:16:30 INFO | PanGenome.pm:301> Processing blast output files complete
2016/06/09 09:16:30 INFO | PanGenome.pm:302> Pan-genome generation complete
2016/06/09 09:16:30 INFO | Panseq.pm:458> Creating zip file
✖ plasmidsCoreSnps generated correctly
Failed test 'plasmidsCoreSnps generated correctly'
at t/output.t line 154.
got: '74159081ba92c9866cab796d6180f3d9'
expected: 'a7d2902d80543446a6701e8f8770301b'
✖ plasmidsPanGenome generated correctly
Failed test 'plasmidsPanGenome generated correctly'
at t/output.t line 155.
got: '74159081ba92c9866cab796d6180f3d9'
expected: '168d75b59dbe825cd91222906a4f5645'

and many lines similar to the above

I am using Perl 5.22.1 on CentOS 6.6. mummer 3.23, muscle 3.8.31, and blast+ 2.3.0

Any advice will be greatly appreciated.

compilation error of t/output.t

perl t/output.t
1..19
Global symbol "$removeRun" requires explicit package name at t/output.t line 13.
Global symbol "$removeRun" requires explicit package name at t/output.t line 137.
Execution of t/output.t aborted due to compilation errors.

Bad plan: 0 != 19

Core genome size and alignment

Hi Chad,

The snp.phylip file contains variant sites of all 'core' fragments, is it possible to extract the full alignment of fragments as a multifasta file per genome?

Thanks in advance for any comments on this!

Get Muscle alignment fasta files from pangenome fragments

Dear Chad,

thank you so much for your great tool!
I have a question concerning the Muscle alignment that is made for the fragments that are present in more genomes than the specified min. threshold.
I would like to use the Muscle alignment output files of the fragments for further analysis, specifically I would like to view the alignment in an alignment viewer like IGV or Jalview and then color the loci identified by the loci selector to see in what regions they are located. I would like to use this information to design suitable amplicons for further analysis.
However, I could not figure out how to get the Muscle alignment from Panseq. Is it possible to directly get it from the Panseq output or do I have to do the Muscle alignment separately myself using the locus_alleles.txt fasta files?

I am using Panseq with perl-5.31.2, muscle 3.8.31, ncbi-blast-2.9.0+ and MUMmer 3.23.

I thank you very much for your help in advance and I am looking forward to using your tool for my research!

Best regards,
Luzia

An error was encountered during analyses

Hi!

I was using Panseq to analysis a group of genomes, but I have a problem as below:

An error was encountered during analyses.
If you uploaded your own data, please ensure it is fasta formatted, and if more than one genome is present in a file, please ensure that the fasta headers follow the format in the FAQ. Please try your analyses again, and if you continue to experience problems, send an email to chad.laing at the canada.ca domain. Thank you for using Panseq.

I have several singular files like this:

lcl|s201|seqone
agcaaccaatctaatcacaagtaactgtttttcaacaagtttctatctgcatcaccgccg
atggatgtgcattatagaccatctcattccctttgcaaggggtatttatccctttttcac
ttgagtgccgtttttttctactattttgcgaaaa...............

lcl|s202|seqone
agcaaccaatctaatcacaagtaactgtttttcaacaagtttctatctgcatcaccgccg
atggatgtgcattatagaccatctcattccctttgcaaggggtatttatccctttttcac
ttgagtgccgtttttttctactattttgcgaaaa.......................

lcl|s203|seqone
agcaaccaatctaatcacaagtaactgtttttcaacaagtttctatctgcatcaccgccg
atggatgtgcattatagaccatctcattccctttgcaaggggtatttatccctttttcac
ttgagtgccgtttttttctactattttgcgaaaaaagtgata...................

Is there any wrong with my file format? I can't find it

thanks very much!

the result

why the result of panGenome.fasta have so many sequence of "NNNN......."

problem with installing Panseq

Heia,
I am trying to install Panseq on my mac.

When I run perl Build.pl I end up with this:
Checking prerequisites...
requires:
! Bio::DB::Fasta is not installed
! Bio::Seq is not installed
! Bio::SeqIO is not installed

ERRORS/WARNINGS FOUND IN PREREQUISITES. You may wish to install the versions
of the modules indicated above before proceeding with this installation

Run 'Build installdeps' to install missing prerequisites.

Could not get valid metadata. Error is: ERROR: Missing required field 'dist_version' for metafile

Could not create MYMETA files
Creating new 'Build' script for 'Panseq' version ''

I then run
./Build installdeps

But it does install those missing ones. Could you tell me how I can get those modules?

Thanks for any help with this,

Camilla

Cannot open settings.txt

I have created a tab-delimited configuration file and named it as settings.txt and placed it in 'lib' directory. When I am giving the command perl lib/panseq.pl settings.txt the following result is returned:
No such file or directory at /home/rks/Downloads/Panseq-master/lib/Modules/Setup/Settings.pm line 284.
Whats the issue?

My configuration file "settings.txt" has the following contents:
queryDirectory /home/rks/Downloads/Panseq-master/16_genome_panseq/
baseDirectory /home/rks/Downloads/Panseq-master/output_panseq_P_pel/
numberOfCores 4
minimumNovelRegionSize 500
novelRegionFinderMode no_duplicates
fragmentationSize 500
percentIdentityCutoff 85
coreGenomeThreshold 3
runMode pan
cdhitDirectory /home/rks/Downloads/cdhit-master/
storeAlleles 1
allelesToKeep 2
nameOrId name
frameshift 1
overwrite 1
maxNumberResultsInMemory 500
blastWordSize 11
nucB 200
nucC 65
nucD 0.12
nucG 90
nucL 20
cdhit 1
sha1 1

Please let me know the solution for this issue

Format of assembly files

I am trying to run panseq on 30 genome assemblies. I have each isolate as a multifasta file containing the scaffolds of that assembly (with unique names). When I run panseq it sees each scaffold as a genome and tells me i have >4000 genomes and not 30. Does the multifasta file need to be formatted in a particular way for the program to understand that each file is one genome? Have I missed something?

Build.PL error: following files missing in your kit

Hi there,

I am having trouble getting Panseq to work. I reinstalled and tried to run perl Build.PL but get the following error.

WARNING: the following files are missing in your kit:
	.idea/.name
	.idea/compiler.xml
	.idea/copyright/profiles_settings.xml
	.idea/encodings.xml
	.idea/inspectionProfiles/Project_Default.xml
	.idea/jscsPlugin.xml
	.idea/jsLinters/jscs.xml
	.idea/jsLinters/jshint.xml
	.idea/jsLinters/jslint.xml
	.idea/misc.xml
	.idea/modules.xml
	.idea/perl5local.xml
	.idea/vcs.xml
	.idea/workspace.xml
	_Inline/.lock
	_Inline/config-x86_64-linux-gnu-thread-multi-5.024001
	_Inline/lib/auto/Bio/DB/IndexedBase_168b/IndexedBase_168b.inl
	_Inline/lib/auto/Bio/DB/IndexedBase_168b/IndexedBase_168b.so
	lib/Interface/html/output/FriFeb2416182920175586.html
	lib/Interface/html/output/FriFeb2416182920175586.html.zip
	lib/Interface/html/output/FriFeb2416184320175772.html
	lib/Interface/html/output/FriFeb2416184320175772.html.zip
	lib/Interface/html/output/FriMay2009131020168122.html
	lib/Interface/html/output/FriMay2009131020168122.html.zip
	lib/Interface/html/output/FriMay2009185220162483.html
	lib/Interface/html/output/FriMay2009225920165242.html
	lib/Interface/html/output/FriMay2009245520168418.html
	lib/Interface/html/output/FriMay2009245520168418.html.zip
	lib/Interface/html/output/FriMay2009253720166882.html
	lib/Interface/html/output/FriMay2009253720166882.html.zip
	lib/Interface/html/output/FriMay2009261520162048.html
	lib/Interface/html/output/FriMay2009261520162048.html.zip
	lib/Interface/html/output/FriMay2009264620167796.html
	lib/Interface/html/output/FriMay2009304320165893.html
	lib/Interface/html/output/FriMay2009304320165893.html.zip
	lib/Interface/html/output/FriMay2009470220162232.html
	lib/Interface/html/output/FriMay2009470220162232.html.zip
	lib/Interface/html/output/FriMay2016125320163557.html
	lib/Interface/html/output/FriMay2016125320163557.html.zip
	lib/Interface/html/output/FriMay2016134120162239.html
	lib/Interface/html/output/FriMay2016134120162239.html.zip
	lib/Interface/html/output/FriMay2016150320163847.html
	lib/Interface/html/output/FriMay2016150320163847.html.zip
	lib/Interface/html/output/FriMay2016163520162080.html
	lib/Interface/html/output/FriMay2016163520162080.html.zip
	lib/Interface/html/output/FriNov2512130520162945.html
	lib/Interface/html/output/FriNov2512144520167628.html
	lib/Interface/html/output/FriNov2512403320166878.html
	lib/Interface/html/output/FriNov2512460520166867.html
	lib/Interface/html/output/FriNov2512460520166867.html.zip
	lib/Interface/html/output/ThuMay1913534820164709.html
	lib/Interface/html/output/ThuMay1913534820164709.html.zip
	lib/Interface/html/output/ThuMay1914040420169910.html
	lib/Interface/html/output/ThuMay1914040420169910.html.zip
	lib/Interface/html/output/ThuMay1914090720167990.html
	lib/Interface/html/output/ThuMay1914090720167990.html.zip
	lib/Interface/html/output/ThuMay1915580220162532.html
	lib/Interface/html/output/ThuMay1916000020168524.html
	lib/Interface/html/output/ThuMay1916023420167873.html
	lib/Interface/html/output/ThuMay1916023420167873.html.zip
	lib/Interface/html/output/ThuMay1916043320165983.html
	lib/Interface/html/output/ThuMay1916043320165983.html.zip
	lib/Interface/html/output/ThuMay1916105520162478.html
	lib/Interface/html/output/ThuMay1916105520162478.html.zip
	lib/Interface/html/output/ThuMay1916144920169259.html
	lib/Interface/html/output/ThuMay1916144920169259.html.zip
	lib/Interface/html/output/ThuMay1916153320162671.html
	lib/Interface/html/output/ThuMay1916153320162671.html.zip
	lib/Interface/html/output/ThuMay1916161020161875.html
	lib/Interface/html/output/ThuMay1916161020161875.html.zip
	lib/Interface/html/output/ThuMay1916163420161189.html
	lib/Interface/html/output/ThuMay1916163420161189.html.zip
	lib/Interface/html/output/ThuMay1916170020164631.html
	lib/Interface/html/output/ThuMay1916170020164631.html.zip
	lib/Interface/html/output/ThuMay1916180320161205.html
	lib/Interface/html/output/ThuMay1916180320161205.html.zip
	lib/Interface/html/output/ThuMay1916185520161934.html
	lib/Interface/html/output/ThuMay1916185520161934.html.zip
	lib/Interface/html/output/ThuMay1916194620169756.html
	lib/Interface/html/output/ThuMay1916210120162607.html
	lib/Interface/html/output/ThuMay1916235020168210.html
	lib/Interface/html/output/ThuMay1916235020168210.html.zip
	lib/Interface/html/output/ThuMay1916262320161910.html
	lib/Interface/html/output/ThuMay1916274020161950.html
	lib/Interface/html/output/ThuMay1916274020161950.html.zip
	lib/Interface/html/output/ThuMay1916294920168425.html
	lib/Interface/html/output/ThuMay1916294920168425.html.zip
	lib/Interface/html/output/ThuMay1916304020163503.html
	lib/Interface/html/output/ThuMay1916304020163503.html.zip
	lib/Interface/html/output/ThuMay1916305220166327.html
	lib/Interface/html/output/ThuMay1916305220166327.html.zip
	lib/Interface/html/output/ThuMay1916310120167487.html
	lib/Interface/html/output/ThuMay1916310120167487.html.zip
	lib/Interface/html/output/ThuMay1916313520168149.html
	lib/Interface/html/output/ThuMay2609031820168191.html
	lib/Interface/html/output/ThuMay2609043620164349.html
	lib/Interface/html/output/ThuMay2609080220164259.html
	lib/Interface/html/output/ThuMay2609093820162607.html
	lib/Interface/html/output/ThuMay2609112520165265.html
	lib/Interface/html/output/ThuMay2609134920169361.html
	lib/Interface/html/output/ThuMay2609170920163081.html
	lib/Interface/html/output/ThuMay2609435320163018.html
	lib/Interface/html/output/ThuMay2609451720162779.html
	lib/Interface/html/output/ThuMay2609464420163104.html
	lib/Interface/html/output/ThuMay2609472820165624.html
	lib/Interface/html/output/ThuMay2609480020162530.html
	lib/Interface/html/output/ThuMay2609493420168077.html
	lib/Interface/html/output/ThuMay2609493420168077.html.zip
	lib/Interface/html/output/ThuMay2609522820164837.html
	lib/Interface/html/output/ThuMay2609522820164837.html.zip
	lib/Interface/html/output/ThuMay2609525320162695.html
	lib/Interface/html/output/ThuMay2609525320162695.html.zip
	lib/Interface/html/output/ThuMay2609555520165545.html
	lib/Interface/html/output/ThuMay2609555520165545.html.zip
	lib/Interface/html/output/ThuMay2609573820162271.html
	lib/Interface/html/output/ThuMay2609573820162271.html.zip
	lib/Interface/html/output/ThuMay2609582620167428.html
	lib/Interface/html/output/ThuMay2609582620167428.html.zip
	lib/Interface/html/output/ThuMay2609585720161022.html
	lib/Interface/html/output/ThuMay2609585720161022.html.zip
	lib/Interface/html/output/ThuMay2609594920162236.html
	lib/Interface/html/output/ThuMay2610003120162170.html
	lib/Interface/html/output/ThuMay2610020420163670.html
	lib/Interface/html/output/ThuMay2610070520161727.html
	lib/Interface/html/output/ThuMay2610083520168063.html
	lib/Interface/html/output/ThuMay2610090320166610.html
	lib/Interface/html/output/ThuMay2610090320166610.html.zip
	lib/Interface/html/output/ThuMay2610112120161878.html
	lib/Interface/html/output/ThuMay2610121320162619.html
	lib/Interface/html/output/ThuMay2610121320162619.html.zip
	lib/Interface/html/output/ThuMay2611012320163285.html
	lib/Interface/html/output/ThuMay2611012320163285.html.zip
	lib/Interface/html/output/ThuMay2611021020161438.html
	lib/Interface/html/output/ThuMay2611021020161438.html.zip
	lib/Interface/html/output/ThuMay2611031720166085.html
	lib/Interface/html/output/ThuMay2613424720169784.html
	lib/Interface/html/output/ThuMay2613424720169784.html.zip
	lib/Interface/html/output/ThuMay2613482120161527.html
	lib/Interface/html/output/ThuMay2613482120161527.html.zip
	lib/Interface/html/output/ThuMay2613490020166904.html
	lib/Interface/html/output/ThuMay2613490020166904.html.zip
	lib/Interface/html/output/ThuMay2613521020162027.html
	lib/Interface/html/output/ThuMay2613521020162027.html.zip
	lib/Interface/html/output/ThuMay2613552620163865.html
	lib/Interface/html/output/ThuMay2613552620163865.html.zip
	lib/Interface/html/output/TueFeb2815062720178415.html
	lib/Interface/html/output/TueFeb2815112220177455.html
	lib/Interface/html/output/TueFeb2815152420178022.html
	lib/Interface/html/output/TueFeb2815152420178022.html.zip
	lib/Interface/html/output/TueFeb2815351020178870.html
	lib/Interface/html/output/TueFeb2815374920174972.html
	lib/Interface/html/output/TueFeb2815385920172911.html
	lib/Interface/html/output/TueFeb2815402020177632.html
	lib/Interface/html/output/TueFeb2815404920176227.html
	lib/Interface/html/output/TueFeb2815434020176717.html
	lib/Interface/html/output/TueFeb2815434020176717.html.zip
	lib/Interface/html/output/TueFeb2815493820179580.html
	lib/Interface/html/output/TueFeb2815493820179580.html.zip
	lib/Interface/html/output/TueMay2414050720162420.html
	lib/Interface/html/output/TueMay2414050720162420.html.zip
	lib/Interface/html/output/WedFeb2208501520177043.html
	lib/Interface/html/output/WedFeb2208522720174969.html
	lib/Interface/html/output/WedFeb2209132120175101.html
	lib/Interface/html/output/WedFeb2209220420176049.html
	lib/Interface/html/output/WedFeb2209281120172709.html
	lib/Interface/html/output/WedFeb2211190220174524.html
	lib/Interface/html/output/WedFeb2211254720179829.html
	lib/Interface/html/output/WedFeb2211304320173311.html
	lib/Interface/html/output/WedFeb2211314820174619.html
	lib/Interface/html/output/WedFeb2211333020172440.html
	lib/Interface/html/output/WedFeb2211383120174320.html
	lib/Interface/html/output/WedFeb2211383120174320.html.zip
	lib/Interface/html/output/WedFeb2211483720175082.html
	lib/Interface/html/output/WedFeb2211483720175082.html.zip
	lib/Interface/html/output/WedMay2515145020162381.html
	lib/Interface/html/output/WedMay2516051820167448.html
	lib/Interface/html/output/WedMay2516051820167448.html.zip
	lib/Interface/html/output/WedMay2516065620161126.html
	lib/Interface/html/output/WedMay2516065620161126.html.zip
	lib/Interface/html/output/WedMay2516085420166409.html
	lib/Interface/html/output/WedMay2516103620166082.html
	lib/Interface/html/output/WedMay2516113820168636.html
	lib/Interface/html/output/WedMay2516212620163783.html
	lib/Interface/html/output/WedMay2516261120165773.html
	lib/Interface/html/output/WedMay2516270120161803.html
	lib/Interface/index.html
	lib/Interface/scripts/assembly_summary.txt
	lib/Interface/scripts/server.conf
	panseq.batch
	Panseq.iml
	t/plasmids/accessoryGenomeFragments.fasta
	t/plasmids/binary.phylip
	t/plasmids/binary_table.txt
	t/plasmids/cdhit.fasta.clstr
	t/plasmids/core_snps.txt
	t/plasmids/coreGenomeFragments.fasta
	t/plasmids/locus_alleles.fasta
	t/plasmids/Master.log
	t/plasmids/pan_genome.txt
	t/plasmids/panGenome.fasta
	t/plasmids/panseq_results.zip
	t/plasmids/phylip_name_conversion.txt
	t/plasmids/snp.phylip
	t/plasmids/snp_table.txt
	t/query.batch
	t/query/accessoryGenomeFragments.fasta
	t/query/binary.phylip
	t/query/binary_table.txt
	t/query/core_snps.txt
	t/query/coreGenomeFragments.fasta
	t/query/locus_alleles.fasta
	t/query/Master.log
	t/query/pan_genome.txt
	t/query/panseq_results.zip
	t/query/phylip_name_conversion.txt
	t/query/snp.phylip
	t/query/snp_table.txt
Please inform the author.

Created MYMETA.yml and MYMETA.json
Creating new 'Build' script for 'Panseq' version '3.1.1'

Problems when installing Test::Pretty dependency

Hi,
I'm trying to install Panseq and all its dependencies. All of them went OK, but Test:Pretty dependency:

You have loaded versions of test modules known to have problems with Test2.

This could explain some test failures.

* Module 'Test::Pretty' is known to be broken in version 0.32 and below, newer versions have not been tested. You have: 0.32

...
Files=20, Tests=55, 3 wallclock secs ( 0.14 usr 0.03 sys + 2.50 cusr 0.32 csys = 2.99 CPU)
Result: FAIL
Failed 11/20 test programs. 24/55 subtests failed.
TOKUHIROM/Test-Pretty-0.32.tar.gz
./Build test -- NOT OK
//hint// to see the cpan-testers results for installing this module, try:
reports TOKUHIROM/Test-Pretty-0.32.tar.gz
Running Build install
make test had returned bad status, won't install without force

I know that the problem is not Panseq itself, but without this dependency, cannot go forward. Any suggestions?
Thanks

i can run the test but not the experiment with my genomes

Hi, i'm trying to run panseq on my genomes but all i get is a warning spamming in stdout:

Overwrite set to true. Deleting directory ./synthetic_output/
2019/04/08 11:31:10 INFO | NovelIterator.pm:186> We have 4961 genomes this run
2019/04/08 11:31:11 WARN | CombineFilesIntoSingleFile.pm:83> Skipping ./synthetic_output/62624a59a83356893f0365cef7132da6_965b7fb4409c563321d33d48470db364_NR as it has size of 0
2019/04/08 11:31:11 WARN | CombineFilesIntoSingleFile.pm:83> Skipping ./synthetic_output/2f922fadbde7c421a86f1c8b15f57ef7_49afd7839dd1e47eda0d7b9e263f94c6_NR as it has size of 0
ERROR: Could not parse delta file, ./synthetic_output/724df4764c6cb27f57630bb7f4db03ff_c64063829facece23b9feafb1fb74417.delta
error no: 400
ERROR: Could not parse delta file, ./synthetic_output/c838799005c75d029fefb614c3c2b511_7e75d0552ec714b52526e79c0bc219ef.delta
error no: 400

in my working directory i get a nucmer.error file that says :
20190408|113121| 613| ERROR: The following critical files could not be used
20190408|113121| 613| /home/steve/Scrivania/MUMmer3.23/aux_bin/postnuc
20190408|113121| 613| /home/steve/Scrivania/MUMmer3.23/aux_bin/prenuc
20190408|113121| 613| /home/steve/Scrivania/MUMmer3.23/mgaps
20190408|113121| 613| /home/steve/Scrivania/MUMmer3.23/mummer
20190408|113121| 613| Check your paths and file permissions and try again

my permissions are all enabled on those files
drwxrwxrwx 2 steve steve 4096 apr 5 15:14 .
drwxrwxrwx 6 steve steve 4096 apr 5 15:15 ..
-rwxr-xr-x 1 steve steve 76448 apr 5 15:14 postnuc
-rwxr-xr-x 1 steve steve 85152 apr 5 15:14 postpro
-rwxr-xr-x 1 steve steve 18320 apr 5 15:14 prenuc
-rwxr-xr-x 1 steve steve 26800 apr 5 15:14 prepro

my settings file is as follows:
queryDirectory ./synthetic_genomes
baseDirectory ./synthetic_output
numberOfCores 6
mummerDirectory ./MUMmer3.23/
blastDirectory ./blast2.9/
minimumNovelRegionSize 500
novelRegionFinderMode no_duplicates
muscleExecutable ./muscle3.8.31/muscle3.8.31
fragmentationSize 500
percentIdentityCutoff 85
coreGenomeThreshold 2
runMode pan
cdhitDirectory ./cd-hit-v4.8.1
overwrite 1

I'm using 10 genomes (from SE001 to SE010), each genome is a collection of genes in fasta format
i.e.

G1_SE001
....
G2_SE001
....

i use blast 2.9.0+ , MUMmer3.23, muscle3.8.31, cd-hit-v4.8.1. All of them are folders inside the Panseq-master folder ( i simply dragged them in ). The reason i did this was to allow myself to run everything inside the panseq-master folder.

I used the online version of panseq ( https://lfz.corefacility.ca/panseq/page/pan.html ) and i get a good result so i guess i my input files are accepted.

thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.