amuralle / pygr Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 586 KB

Automatically exported from code.google.com/p/pygr

Python 100.00%

pygr's People

Contributors

pygr's Issues

add schema grouping to pygr.Data

A completely new feature, for organizing, selecting and managing schema
information.  Pretty simple to implement, I think.  I will add a wiki page
on this topic.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 4:39

Write a release procedure

as Titus suggested, we should write a SOP for producing the final
package(s) of any release.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 3:49

Installation/Usage Problems with the Sourceforge versions 0.5 & 0.7b3

I remember Chris mentioning that I should document these errors even though
we aren't using these versions anymore.

Tested Operating System: Ubuntu Linux, 32 bit

Pre-installed software: python-pyrex, python-dev, g++, gcc,  make, cmake

Documented failures: installing, testing with command:
'python run-doctests.py annotating-yeast.txt'

Notes:
I tested 0.7b3 and 0.5 inside a Vmware Virtual Machine. I reverted the VM
before installing either version to simulate a clean install. In the case
of 0.7b3, I installed motility with no problems and was able to run the
annotating-yeast.txt doctest, though it failed 30 of the tests. 0.5 never
installed correctly so I wasn't able to test it.

Original issue reported on code.google.com by [email protected] on 27 May 2008 at 6:31

Attachments:

KeyError for BLAST results that return no hits

What steps will reproduce the problem?
g = pygr.Data.getResource("Bio.Seq.Genome.YEAST.sacCer")
s = Sequence("TCTTCCTCACTCTCAGGGT", "test")
r = genome.blast(s, maxseq=1)
r[s].edges()

What is the expected output? What do you see instead?
I guess the expected output would be an empty list....

I get:
---------------------------------------------------------------------------
<type 'exceptions.KeyError'>              Traceback (most recent call last)

/home/baldig/projects/genomics/svn/data/yeast/2008_9_16_12_23_28_1_260_1/yeast/l
ocations/<ipython
console> in <module>()

/home/baldig/projects/genomics/svn/data/yeast/2008_9_16_12_23_28_1_260_1/yeast/l
ocations/pygr.cnestedlist.pyx
in pygr.cnestedlist.NLMSA.__getitem__()

/home/dock/shared_libraries/lx64/pkgs/python/2.5.1/lib/python2.5/site-packages/p
ygr/nlmsa_utils.py
in __getitem__(self, seq)

/home/dock/shared_libraries/lx64/pkgs/python/2.5.1/lib/python2.5/site-packages/p
ygr/nlmsa_utils.py
in getSeqID(self, seq)

/home/dock/shared_libraries/lx64/pythonpkgs/2.5.1/pygr_0_7_1/pygr/seqdb.py
in __getitem__(self, seq)
   1456         'handles optional mode that adds seq if not already present'
   1457         try:
-> 1458             return self.getName(seq)
   1459         except KeyError:
   1460             if self.db.addAll:

/home/dock/shared_libraries/lx64/pythonpkgs/2.5.1/pygr_0_7_1/pygr/seqdb.py
in getName(self, seq)
   1450         except AttributeError: # NO db?  THEN TREAT AS A user SEQUENCE
   1451             userID='user'+self.db.separator+seq.pathForward.id
-> 1452             s=self.db[userID] # MAKE SURE ALREADY IN user SEQ
DICTIONARY
   1453             return userID # ALREADY THERE
   1454

/home/dock/shared_libraries/lx64/pythonpkgs/2.5.1/pygr_0_7_1/pygr/seqdb.py
in __getitem__(self, k)
   1356             prefix = t[0] # ASSUME PREFIX DOESN'T CONTAIN separator
   1357             id = k[len(prefix)+1:] # SKIP PAST PREFIX
-> 1358         d=self.prefixDict[prefix]
   1359         try: # TRY TO USE int KEY FIRST
   1360             return d[int(id)]

<type 'exceptions.KeyError'>: 'user'



What version of the product are you using? On what operating system?
0.7.1

Please provide any additional information below.
The output of the blastall command from the command line is below. The
g.blast returns no results because the e-value is above the threshold, but
the KeyError of "user" is strange and not meaningful. An empty list
returned would be more useful for not having results.

$ cat test
>test
TCTTCCTCACTCTCAGGGT
$ blastall -i test -p blastn -d ~baldig/projects/genomics/genomes/sacCer.fa
-v 1 -b 1
BLASTN 2.2.18 [Mar-02-2008]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= test
         (19 letters)

Database: /home/baldig/projects/genomics/genomes/sacCer.fa
           17 sequences; 12,156,677 total letters

Searching..................................................done



                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

chr7                                                                   26 
 0.91

>chr7
          Length = 1090947

 Score = 26.3 bits (13), Expect = 0.91
 Identities = 13/13 (100%)
 Strand = Plus / Minus


Query: 2      cttcctcactctc 14
              |||||||||||||
Sbjct: 102012 cttcctcactctc 102000


  Database: /home/baldig/projects/genomics/genomes/sacCer.fa
    Posted date:  Sep 12, 2008  3:19 PM
  Number of letters in database: 12,156,677
  Number of sequences in database:  17

Lambda     K      H
    1.37    0.711     1.31

Gapped
Lambda     K      H
    1.37    0.711     1.31


Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Sequences: 17
Number of Hits to DB: 1171
Number of extensions: 50
Number of successful extensions: 13
Number of sequences better than 10.0: 7
Number of HSP's gapped: 13
Number of HSP's successfully gapped: 13
Length of query: 19
Length of database: 12,156,677
Length adjustment: 13
Effective length of query: 6
Effective length of database: 12,156,456
Effective search space: 72938736
Effective search space used: 72938736
X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
X3: 50 (99.1 bits)
S1: 12 (24.3 bits)
S2: 12 (24.3 bits)

Original issue reported on code.google.com by [email protected] on 25 Sep 2008 at 8:46

Add full readonly dict interface to BlastDBbase and PrefixUnionDict

Uses DictMixin and a few judicious function definitions to correct the dict
interface for BlastDBbase and add a full dict interface to
PrefixUnionDict.  __setitem__-based functions are added to SeqDBbase to
raise NotImplementedErrors instead of silently doing the default dict stuff.

Original issue reported on code.google.com by [email protected] on 17 Aug 2008 at 5:30

Attachments:

0001-fixed-dict-interfaces-for-BlastDB-PrefixUnionDict-a.patch

run XMLRPC server in a separate thread by default

run server in a separate thread so that administrator keeps an active
Python prompt, and can continue to manage the server through that Python
interface, e.g. run Python via screen so you can reconnect at any time to
the server process and manage it via new Python commands.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 4:26

define informative repr for BlastDBbase, BlastDB, and PrefixUnionDict; better KeyError report

replaces KeyError report:

  File "/Users/t/dev/pygr/pygr/seqdb.py", line 777, in get_real_id
    raise KeyError # FOUND NO MAPPING, SO RAISE EXCEPTION
KeyError


with more informative:

KeyError: "no key 'x' in database <BlastDB 'dnaseq'>"

and

KeyError: "no key 'db2.x' in <pygr.seqdb.PrefixUnionDict object at 0xd95b90>"

Original issue reported on code.google.com by [email protected] on 16 Aug 2008 at 10:41

Attachments:

0001-defined-informative-repr-for-BlastDB-and-PrefixU.patch

Add pygr build files and .pygr_data to .gitignore

Added more temporary/generated files to .gitignore.

Original issue reported on code.google.com by [email protected] on 21 Jun 2008 at 8:53

Attachments:

0002-added-pygr-build-files-and-.pygr_data-to-root-.gitig.patch

patch: correctly identify user's MySQL config file on Windows

This patches sqlgraph.py to find my.ini or my.cnf under WINDIR and SYSTEMDRIVE.

Original issue reported on code.google.com by [email protected] on 21 Jun 2008 at 8:52

Attachments:

0001-nolleyal-s-patch-to-correctly-identify-the-MySQL-use.patch

minor refactoring of protest.py

 2008 12:54:22 -0700
Subject: [PATCH] nolleyal and titus's minor refactoring of protest.py.

Briefly,

 - switched the os.system call to subprocess, which removes the need to
   use temporary files to store stdout;

 - added stderr capture & replay;

 - added #! /usr/bin/env python at the top to support 'protest.py's
   direct use from the shell;

 - added a docstring at the top of the file;

 - removed space from '.' output to conform to nose's output;

 - cleaned up the main test runner and a few other places for PEP-8.

NOTE that subprocess doesn't exist in Python 2.3, so this axes Python
2.3 compatibility for the tests.  I don't know how important this is, as
every major platform now comes with 2.4 or above; see mailist list
discussion.

Original issue reported on code.google.com by [email protected] on 21 Jun 2008 at 8:55

Attachments:

0004-nolleyal-and-titus-s-minor-refactoring-of-protest.py.patch

Add explicit equality test for independently generated annotation objects.

...

Original issue reported on code.google.com by [email protected] on 1 Sep 2008 at 6:54

Attachments:

0003-added-explicit-equality-test-for-independently-gener.patch

incomplete dict method support for various pygr dict-like classes

Pygr emphasizes using dict-like interfaces to database tables.  These
interfaces often subclass the built-in dict class, and "cache" objects
retrieved from the database using the built-in dict methods.  However, this
means that many of the standard dict methods actually only reflect data
that is in the cache, rather than the complete set of data in the database.
 This greatly diminishes the value of the dict-like interface!

For example, using seqdb.BlastDB as retrieved from pygr.Data:
What steps will reproduce the problem?
>>> import pygr.Data
>>> hg17 = pygr.Data.Bio.Seq.Genome.HUMAN.hg17()
>>> hg17.keys()
[]
>>> 'chr1' in hg17
False

This issue was raised in
http://groups.google.com/group/pygr-dev/t/80736ba41bc79739?hl=en

Different database classes in pygr have different levels of dict method
support.  All of them correctly support __getitem__() (and __setitem__() if
the interface is supposed to be writable), and __iter__().  Many of them
correctly support the additional iterator methods (i.e. keys(), items(),
iteritems(), values(), itervalues()).  __len__() should also be correctly
supported in most cases.  Less common operations like copy(), clear(),
update(), get(), setdefault(), pop() are not implemented.

We should do a survey of all dict-like classes in pygr and identify gaps in
the Mapping Protocol support.  Obvious points:

__contains__(), and __len__() must reflect the database, not the cache

__setitem__() and __delitem__(), update(), clear(), setdefault(), pop()
should raise exceptions if the database is not writable, rather than just
silently affecting the cache.

We should provide a standard method for clearing the cache, e.g.
clear_cache(), since clear() no longer would be available for that purpose.
 Control over actual memory usage is very important for working with large
datasets.

Note that caching guarantees an important property for Object-Relational
Mapping, namely that different requests for the same key are guaranteed to
return the same object.

Note that some of the Mapping Protocol methods imply instantiating all
items in the database: e.g. items(), values(), copy().  Pygr tries to
follow this logic, to give users a reasonably intuitive level of control
over whether data will be retrieved from the database on a row-by-row basis
vs. loading all rows via a single query.  Specifically, methods like
__iter__() and keys() do not themselves force loading of all rows from the
database, whereas methods like items() do.  The logic here is that by
calling items(), the user is declaring an intent to examine every single
row in the database, so this should be done with a single query to maximize
performance.

Original issue reported on code.google.com by [email protected] on 26 May 2008 at 9:39

Disallow overwriting seqdb.BlastDB in pygr.Data or Print out error message on overwriting attempt

Sometimes, when you are using pygr, you may encounter error message 
saying "seq not in prefixuniondict" even if you think you are using same 
Annotation DB or NLMSA. Overwriting seqdb.BlastDB in pygr.Data can sever 
the connection between seqdb and AnnotDB/NLMSA. I think we need to 
disallow overwriting seqdb.BlastDB in pygr.Data or print out error message 
saying it is not good...

Original issue reported on code.google.com by [email protected] on 31 May 2008 at 7:37

test_loader fails

What steps will reproduce the problem?
1. Run test_loader.py from git repository
2. testGraphEdges fails

Apparently, edge_result has three values in each item in the list, the last
one being None, whereas the expected result has only two.

I've included a patch in which the expected result matches edge_result. I
have also changed the import so it will import from pygr, instead of
relying on the pygr environment variable. Feel free to ignore this.

Original issue reported on code.google.com by [email protected] on 11 May 2008 at 3:44

Attachments:

test_loader.py.diff

BlastDB has gotten slow due to cache

What steps will reproduce the problem?
1. Create FASTA file with 50 million sequences in it

outfile = open('R1', 'w')
for icount in range(1, 50000000):
    outfile.write('>' + str(icount) + '\n')
    outfile.write('ACGT\n')
outfile.close()

2. Open that FASTA (requires BlastDB building too)

from pygr import seqdb
R1 = seqdb.BlastDB('R1')

What is the expected output? What do you see instead?

Openning R1 should be fast without preloading sequence IDs. But, currently 
BlastDB loads every sequence IDs into memory and it takes several minutes 
to just open BlastDB. And that affects performance of NLMSA.

Please use labels and text to provide additional information.

1. Version as of August 13.

>>> seqdb.BlastDB('R1')
{}

Less than 1 sec. Returns empty dict.

2. Version as of Today.

>>> seqdb.BlastDB('R1')
<BlastDBbase 'R1'>

Took several minutes and load all indice into memory.

Original issue reported on code.google.com by [email protected] on 14 Oct 2008 at 11:22

Fix bug in sqlgraph.SQLTable.generic_iterator; add associated tests

Calling SQLTable.iteritems() causes an error because the default 'cache_f'
function is incorrect.

'cache_f' defaults to the unbound class method cache_items(); it needs to
either be set to the bound method self.cache_items() (my solution) OR
'self' must be passed into cache_items() explicitly in the generic_iterator
function.

The patch also adds a new file, tests/sqltable_tests.py, that tests basic
dict reading behavior.  DictMixin is used to provide 'get' functionality
among others.  dict writing and deletion is not tested.

Original issue reported on code.google.com by [email protected] on 3 Sep 2008 at 2:03

Attachments:

Basic-SQLTable-tests-and-resulting-bug-fix-for-gene.patch

KeyError on deleting pygr.Data resource

What steps will reproduce the problem?
1. Save a resource to pygr.Data with no SCHEMA
2. Delete the resource (i.e. via pygr.Data.deleteResource )


What is the expected output? What do you see instead?
Expected output is nothing.  We see a KeyError instead:

Traceback (most recent call last):
  File "/Users/Robby/pygr-dev/bin/python", line 23, in <module>
    execfile(sys.argv[0])
  File "/Users/Robby/Desktop/delete-resource-schema-error.py", line 32, in
<module>
    remRes()
  File "/Users/Robby/Desktop/delete-resource-schema-error.py", line 26, in
remRes
    pygr.Data.deleteResource(sqltab)
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/p
ygr-0.7-py2.5-macosx-10.3-fat.egg/pygr/Data.py",
line 864, in deleteResource
    self.delSchema(id,layer)
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/p
ygr-0.7-py2.5-macosx-10.3-fat.egg/pygr/Data.py",
line 1019, in delSchema
    d=db.getschema(id) # GET THE EXISTING SCHEMA
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/p
ygr-0.7-py2.5-macosx-10.3-fat.egg/pygr/Data.py",
line 536, in getschema
    return self.db['SCHEMA.'+id]
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/shelve.py", 
line
112, in __getitem__
    f = StringIO(self.dict[key])
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/bsddb/__init__.
py",
line 223, in __getitem__
    return _DeadlockWrap(lambda: self.db[key])  # self.db[key]
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/bsddb/dbutils.p
y",
line 62, in DeadlockWrap
    return function(*_args, **_kwargs)
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/bsddb/__init__.
py",
line 223, in <lambda>
    return _DeadlockWrap(lambda: self.db[key])  # self.db[key]
KeyError: 'SCHEMA.Bio.Seq.Genome.ECOLI'


What version of the product are you using? On what operating system?
Using MacOS v10.4 or v10.5
git commit # c42925322ab8258b40bd09fd9bf0c871c6ebf75a


Please provide any additional information below.
The pygr.Data resource does in fact get deleted but a call is made to
delSchema() with a pygr.Data resource id of 'SCHEMA.name.of.resource'.  If
no schema was ever saved, this resource will never exist and thus retrieval
via "d=db.getschema(id) # GET THE EXISTING SCHEMA" will always produce the
KeyError.

See attached script and modified ecoli 'genome' file to reproduce.

Original issue reported on code.google.com by [email protected] on 17 Oct 2008 at 11:22

Attachments:

Standardize ensembl components

Make SQLTable, SQLGraph and kin provide the ORDER BY etc. capabilities that
forced Jenny to write custom code for the Ensembl interface.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 3:58

patch to test seq dicts passed into AnnotationDB.

This is a proposed solution to the problem where users pass in the wrong
sequence dictionary to AnnotationDB and get an error only when they ask for
an annotation object.  Because objects are created on -demand, the error
they get shows up after the construction of the AnnotationDB; the idea in
this patch is to test the first value in the AnnotationDB and raise an
error if it fails.

(This is a common user error in my experience, and pygr's error reporting
is confusing.)

Original issue reported on code.google.com by [email protected] on 1 Sep 2008 at 6:48

Attachments:

0001-proposed-addition-to-AnnotationDB-to-test-seq-dicts.patch

Initial delay for membership checking in seqdb.BlastDB

I found another slowdown in seqdb.BlastDB. For previous discussion, you can
see postings on http://groups.google.co.kr/group/pygr-dev/t/48c4a3b6d0fec0e6

I made a seqdb.BlastDB, and it has 30 million sequences.

>>> from pygr import seqdb
>>> R1 = seqdb.BlastDB('R1')
>>> R1.has_key('1') # IF I CHECK THE CORRECT SEQUENCE ID, IT IS FAST
True
>>> R1.has_key('3126554') # IF I CHECK THE CORRECT SEQUENCE ID, IT IS FAST
True
>>> R1.has_key('A') # IF I CHECK THE NON-EXISTING SEQUENCE ID, IT TOOK
ABOUT 5 MINUTES TO RETURN RESULTS
False
>>> R1.has_key('B') # BUT IF I CHECK THE NON-EXISTING SEQUENCE ID AGAIN, IT
IS FAST
False
>>> R1.has_key('C') # BUT IF I CHECK THE NON-EXISTING SEQUENCE ID AGAIN, IT
IS FAST
False

I don't know what is happening here, but it means we have to wait a few
minutes if we enter wrong sequence IDs onto seqdb.BlastDB.

Original issue reported on code.google.com by [email protected] on 18 Nov 2008 at 10:42

megatest failure: cannot delete from dict

See groups discussion:

http://groups.google.com/group/pygr-dev/browse_thread/thread/123a4a6b92d7abe8

Briefly, this is a pyrex bug related to an optimization for C integers when
indexing maps; the two proposed workarounds are

a) use d.__delitem__(intkey) instead of del d[intkey]

b) change the intkey to no longer be a C integer

I'm agnostic on which one of these to go with.

--titus

Original issue reported on code.google.com by [email protected] on 25 Oct 2008 at 1:49

Update docs for 0.8 refactoring

Lots of refactoring needs to be updated in the docs, e.g.
 * SequenceDB and kin
 * SQLTable and kin
 * subclass binding
 * AnnotationDB and kin
 * autoGC option
etc.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 4:19

crashing and compilation errors on Windows

The Windows binary distribution of pygr crashes for me when building 

cnestedlist.NLMSA('target', mode='w', use_virtual_lpo=True)

(the in memory build works!). 

When trying to compile the sources on windows the compile fails by not
finding the fseeko symbol (this is a Linux specific function). By adding a
conditional compilation to this macro I managed to fix this error (see
patch at the end) and the example above now works as expected. 

best.

Original issue reported on code.google.com by [email protected] on 12 May 2008 at 3:23

Attachments:

patch

Refactor BLAST support as a mapping

Currently this is provided via two methods on BlastDB: blast() and megablast().

In the spirit of generalizing this to follow the "Pygr pattern", this
should instead be a mapping, specifically a Pygr graph interface that takes
sequences as source nodes, returns homologous sequences as destination nodes.

BlastDB can keep its blast() and megablast() methods, for purposes of
backwards compatibility.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 4:10

tblastn and blastx support

Right now pygr is restricted to 1:1 alignment relations, which works fine
for blastn and blastp, but not tblastn (protein query vs. nucleotide
database translated to protein sequence) or blastx (nucleotide query vs.
protein database).


tblastn and blastx are problematic for several reasons:
- the returned alignment is not of the actual query sequence and database
sequences, but instead of a *translation* (possibly after
reverse-complementing!) of one side or the other.  Thus the alignment
results are NOT in the coordinate system of the query and the database
seqs; instead they involve a new coordinate system (a translation) created
on the fly.

- this involve a 3:1 alignment relation between nucleotide vs. protein
sequence.  This is problematic in all sorts of ways, the most fundamental
of which is how to robustly represent the reading frame "phase" for any
given part of the alignment (i.e. the ability to represent alignment to a
"partial codon", which can easily occur when aligning protein against exons
which may split a single codon across an exon-exon junction.

- I think tblastn/blastx imply the need a separate coordinate system for
this nucleotide vs. protein alignment problem.  For example, what if the
query is a nucleotide sequence and finds a reverse-complement homology to a
protein sequence?  I.e. when the query is reverse-complemented, it has a
translated-homology to the protein sequence.  The result of any alignment
query must always be returned in the same orientation as the user-supplied
query, which means that the homologous protein interval must be returned in
"negative orientation" -- which of course does not exist for a true protein
sequence.

POSSIBLE SOLUTIONS:

I think this would be easy to resolve by using an annotation to represent
the open reading frame on the protein sequence. The key idea is that an
annotation is an independent coordinate system, but can be converted to the
corresponding sequence interval by requesting its sequence attribute.  So
we could have tblastn return 1:1 alignments of nucleotide sequence to an
ORF annotation (whose coordinate system would be expressed in bp, not aa).
 The user would request its sequence attribute to obtain the corresponding
protein sequence interval.  This would work well in both directions (i.e.
tblastn, and blastx).

The ORF annotation idea solves the "intermediate coordinate system" problem
nicely: it is a nucleotide coordinate system (which can correctly represent
either orientation).  But it is bound to the protein sequence that it
represents, and you can always convert a slice of an ORF annotation to the
corresponding slice of protein sequence by simply accessing its "sequence"
attribute.  We could even map such ORF annotations directly onto genomic
sequence.

Original issue reported on code.google.com by [email protected] on 27 Sep 2008 at 1:16

Patch for NLMSASlice cache hint memory leak; associated tests; refactoring of cache code.

cnestedlist.NLMSASlice has a memory leak (under Pyrex 0.9.8.4) where
__dealloc__ cannot properly delete cache objects from seqdb.cacheProxyDict
stored under self.deallocID, because __dealloc__ cannot call Python methods.

The three attached patches demonstrate this problem, fix the problem, and
then refactor the cacheHint code to eliminate now-unnecessary logic.

Original issue reported on code.google.com by [email protected] on 1 Sep 2008 at 6:11

Attachments:

support for gzipped file: NLMSABuilder, textfile_to_binaries, and dump_textfile

Text dump files from pre-built NLMSAs (available at 
http://biodb.bioinformagtics.ucla.edu/PYGRDATA) are extremely large. In 
total, if we uncompress them, it would be about 900GB.

I propose gzipped file support for three classes, NLMSABuilder, 
textfile_to_binaries, and dump_textfile. In that way, we can save disk 
space. Maybe we could give one option, compressed = True (or False) to 
work with gzipped archives.

Original issue reported on code.google.com by [email protected] on 1 Jun 2008 at 12:14

downloader.py, tarball has unnecessary path characters in uncompressed files

In tar.gz file, I usually uncompressed the files in directory and moved 
all files to parent directory. And then made one file for pygr.seqdb.

If we want to read that files in python zlib module, there is a problem. 
Because of the path in tar, first line of the extracted faasta file has 
binary characters for path! Thus, pygr does not recognize those files as 
FASTA. Python cannot read following files from downloader.py.

Is there any tar python library? Otherwise, the only solution may be 
extracting files by command line, which may arise another platform 
indepdence issue.


==> apiMel3 <==
Group1.fa0000664000462000024300014446262010701502220012075 0ustar  
angieprotein>Group1

==> caeRem2 <==
chrUn.fa0000664000431100024300116610013110615666653012031 0ustar  
hiramprotein>chrUn

==> canFam2 <==
1/chr1.fa0000664000462000024300075061311410344202301013154 0ustar  
angieprotein00000000000000>chr1

==> cb3 <==
chrI.fa0000664000431100024300005367547210607775707011664 0ustar  
hiramprotein>chrI

==> ce4 <==
chrI.fa0000664000431100024300007251306210605303060011620 0ustar  
hiramprotein>chrI

==> danRer3 <==
10/chr10.fa0000664000552100024300022532764310254136063013701 0ustar  
harteraprotein00000000000000>chr10

==> danRer4 <==
1/chr1.fa0000664000552100024300042252424310422467370012300 0ustar  
harteraprotein>chr1

==> dm3 <==
chr2L.fa0000664000462000024300013142324610636536751011716 0ustar  
angieprotein>chr2L

==> droSim1 <==
chr2L.fa0000664000462000024300012557376010226542767013167 0ustar  
angieprotein00000000000000>chr2L

==> droYak2 <==
4/chr4.fa0000664000462000024300000526216210336451111013164 0ustar  
angieprotein00000000000000>chr4

==> equCab1 <==
chr1.fa0000664000460600024300127305357310565133445012012 0ustar  
fanhsuprotein>chr1

==> fr2 <==
chrM.fa0000664000431100024300000004061610564660635011635 0ustar  
hiramprotein>chrM

==> galGal3 <==
chr1.fa0000664000462000024300141604161610464745460011603 0ustar  
angieprotein>chr1

==> gasAcu1 <==
chrI.fa0000664000462000024300015552750710470675544013104 0ustar  
angieprotein00000000000000>chrI

==> mm6 <==
1/chr1.fa0100664000431100024300136712674310215371015011751 0ustar  
hiramprotein>chr1

==> mm7 <==
1/chr1.fa0100664000431100024300136634417410304660023011747 0ustar  
hiramprotein>chr1

==> mm8 <==
1/chr1.fa0000664000431100024300137663025010375140642011750 0ustar  
hiramprotein>chr1

==> mm9 <==
chr1.fa0000664000431100024300137722222310651703673011614 0ustar  
hiramprotein>chr1

==> monDom4 <==
1/chr1.fa0000664000431100024300553653211710374717613011764 0ustar  
hiramprotein>chr1

==> oryLat1 <==
chr1.fa0000664000431100024300023342162410564147371011610 0ustar  
hiramprotein>chr1

==> panTro2 <==
1/chr1.fa0000664000522300024300157665055710404735056011617 0ustar  
kateprotein>chr1

==> ponAbe2 <==
chr1.fa0000664000462000024300157654750010700310073011572 0ustar  
angieprotein>chr1

==> priPac1 <==
chrUn.fa0000664000431100024300125026220510615674630012030 0ustar  
hiramprotein>chrUn

==> rheMac2 <==
softMask/chr1.fa0000664000552100024300157010116210443356564013730 0ustar  
harteraprotein>chr1

==> rn4 <==
1/chr1.fa0000664000462000024300202234056610406567020011731 0ustar  
angieprotein>chr1

==> strPur1 <==
urchin.hardMasked.fa0100664000441500024301026415226310231150023014111 
0ustar  aampprotein>Scaffold99932

apiMel3/chromFa.tar.gz
caePb1/chromFa.tar.gz
caeRem2/chromFa.tar.gz
canFam2/chromFa.tar.gz
cb3/chromFa.tar.gz
ce4/chromFa.tar.gz
danRer3/chromFa.tar.gz
danRer4/chromFa.tar.gz
dm3/chromFa.tar.gz
droSim1/chromFa.tar.gz
droYak2/chromFa.tar.gz
equCab1/chromFa.tar.gz
fr2/chromFa.tar.gz
galGal3/chromFa.tar.gz
gasAcu1/chromFa.tar.gz
mm6/chromFa.tar.gz
mm7/chromFa.tar.gz
mm8/chromFa.tar.gz
mm9/chromFa.tar.gz
monDom4/chromFa.tar.gz
oryLat1/chromFa.tar.gz
panTro2/chromFa.tar.gz
ponAbe2/chromFa.tar.gz
priPac1/chromFa.tar.gz
rheMac2/chromFa.tar.gz
rn4/chromFa.tar.gz
strPur1/allFa.tar.gz

Original issue reported on code.google.com by [email protected] on 14 May 2008 at 12:03

add NLMSA convenience method for loading aligned intervals from any source

pass it an iterator that generates (ival1, ival2) pairs, and they will be
automatically loaded into your NLMSA.  This is just a convenience (it is
not hard to save aligned intervals into NLMSA by the existing graph
interface), but may make it easier for people to add other formats... 
Maybe we could add a CLUSTAL reader as an example...

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 5:02

pygr.Data.dir search improvement

Right now, pygr.Data.dir returns matches on first part of the given 
argument. It will be more convenient if we could add more search engine on 
pygr.Data resources.

1. pygr.Data.dir() : allow empty query as full search like pygr.Data.dir
('')
2. partial match search by string member check :
>>> 'ab' in 'abc'
True
Thus we can do like this,
pygr.Data.dir('HUMAN') => returns all resources if resource name 
has 'HUMAN' string in it.
3. regular expression. Above 2 can be improved more if we add re module 
search on pygr.Data.dir, say pygr.Data.regexp('*.hg17')

Original issue reported on code.google.com by [email protected] on 3 Jun 2008 at 9:08

Add read-only dict behavior and associated tests to AnnotationDB.

...

Original issue reported on code.google.com by [email protected] on 1 Sep 2008 at 6:53

Attachments:

0002-added-readonly-tests-to-AnnotationDB.patch

add SSL XMLRPC client and server support

This is needed for managing pygr.Data XMLRPC servers remotely in a secure
fashion.

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 4:21

decorator syntax incompatible with Python 2.3

What steps will reproduce the problem?
1.from pygr import seqdb
2.
3.

What is the expected output? What do you see instead?

File "/Users/jenny/Desktop/pyensembl-0.1.0/pygr/seqdb.py", line 12
   @classmethod
   ^
SyntaxError: invalid syntax


What version of the product are you using? On what operating system?
using the pygr pulled from the git repository on October 11th, 2008

Mac OS X (10.4)

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 15 Oct 2008 at 5:14

Unnecessary non-PEP-8 spacing in setup.py

Too many inter-line spaces in setup.py!

Original issue reported on code.google.com by [email protected] on 22 Aug 2008 at 5:00

add pygrtest_common import to all primary tests

nolleyal's patch to add pygrtest_common to all primary tests.

pygrtest_common is an import common to the five primary test files
(blast_test, graph_test, nlmsa_test, pygrdata_test, and sequence_test).
It allows stock to be run before (or after) each test.

This is in support of two goals: first, adding code coverage analysis to
the tests; and second, adding an import path requirement.

The first goal is to analyze the code coverage of the current tests over
the pygr code base, in the interests of identifying areas that aren't
executed by the current tests.  Because 'protest' runs each test in its
own process (necessary for some tests, e.g. the pygr.Data tests) we
cannot run figleaf just on protest.py; it needs to be run for each of
the test cases.  We decided the best way to do this was to turn it on in
a common file which is imported by each test module.

Note that pygrtest_common only turns on figleaf code coverage recording
in test modules if it was on in protest.py; this is communicated by
using an environment variable.

The second goal of having a common import was to add an import path
mod & check to make sure that the tests are importing the development
version of pygr.  This removes the need to set PYTHONPATH and also
overcomes problems with some versions of easy_install, where the
installed egg is imported despite PYTHONPATH settings.  It also makes
it possible to "just run" the tests and be fairly confident that the
development tree is the version being tested.

The patch correctly looks for pygr in both the pygr/ directory and the
build/lib.$PLATFORM directory, so you can use either

   python setup.py build_ext -i

or

   python setup.py build

to compile pygr before running the tests.

Original issue reported on code.google.com by [email protected] on 21 Jun 2008 at 8:57

Attachments:

0005-nolleyal-s-patch-to-add-pygrtest_common-to-all-prima.patch

add a test for SQLSequence class method problem

we don't seem to have any tests for SQLSequence, which is pretty pathetic.
 We need to add at least some basic sequence retrieval and manipulation
tests...

Original issue reported on code.google.com by [email protected] on 27 Sep 2008 at 12:34

Analyze code coverage of tests, including megatests

Run the tests using the figleaf code coverage analysis tool and determine
which parts of the pygr codebase are not executed by the tests.

Original issue reported on code.google.com by [email protected] on 27 May 2008 at 4:35

Test under Python 2.3-2.6

Make sure that the latest changes to protest.py work under Python 2.2 and
2.3 and run all the other tests on those versions, too.  The use of
subprocess may have to be changed, in particular.

Original issue reported on code.google.com by [email protected] on 22 Aug 2008 at 5:02

Move old tests into tests/deprecated

tests/ contains a bunch of stuff that Chris Lee says isn't used any more --
now that we have cast off the shackles of CVS, let's move that into
tests/deprecated/ or something.

Original issue reported on code.google.com by [email protected] on 27 May 2008 at 4:41

add status reporting of skip/errors while running unit tests with protest

with this patch, protest will now output 'E' (error) and 'S' (skip)
appropriately while running the tests.  As per standard nose/unittest
behavior, requested by Istvan Albert.

Depends on patch in issue 18.

Original issue reported on code.google.com by [email protected] on 17 Aug 2008 at 1:03

Attachments:

0001-added-S-for-skip-and-E-for-error-while-running.patch

fix pygr.sf.net, remove? annotate? old downloads

pygr.sf.net should redirect somewhere; right now it's an empty page. this
can be done either with a .htaccess file or with an http-equiv redirect
statement in index.html

Also, the old files etc on that site are, well, old and buggy.  Can we just
put an README or some such there so that people are directed to the latest
version?  IS there a latest tarball that's automatically generated nightly
from the stable branch of pygr?

Original issue reported on code.google.com by [email protected] on 27 May 2008 at 4:44

TBLASTN parsing error

Bug in BLAST parser - from a git checkout on September 15th 2008                


Background:

In order to get around a bug in tblastn the start position of the sequence
for the subject line is set to the start position of the query above it.        



However if the first base of the query line is Q, then the find matches the
Q of Query: and it stores 0 as the offset of the start of the subject
sequence. See the last line of the result below plus traceback:                 


>ref|NC_007503.1| Carboxydothermus hydrogenoformans Z-2901, complete genome
          Length = 2401520                                                 

 Score = 90.5 bits (223), Expect = 4e-17,   Method: Compositional matrix
adjust.
 Identities = 53/181 (29%), Positives = 97/181 (53%), Gaps = 12/181 (6%)
 Frame = +2

Query: 13    GKVLWQNLTFTISAGERVGIHAPSGTGKTTLGRVLAGWQKPTAGDVLLDGSPFPLHQYCP
72
             G+V+   +TFT+  G+ +G+  PSG GK++L R+L     PT+G++   G    + +Y P
Sbjct: 99509 GQVILDGITFTVEEGDFLGVLGPSGAGKSSLFRLLNRLLSPTSGEIYYRGK--NIKEYDP
99682

Query: 73    VQLVPQHPELTFNPWRSAGDAVRD--------AWQPDPETLRRL----HVQPEWLTRRPM
120
             ++L  +   +   P+      + D          +PD E + +     +++ E L ++P
Sbjct: 99683 IKLRREIGYVLQRPYLFGQKVLEDLTYPFRIRQEKPDMELIYKYLAQANLKEEILAKKPT
99862

Query: 121   QLSGGELARIAILRALDPRTRFLIADEMTAQLDPSIQKAIWVYVLEVCRSRSLGMLVISH
180
             +LSGGE  RI+++R L  + R L+ DE+T+ LD    +AI   +L+    ++L +L I+H
Sbjct: 99863 ELSGGEAQRISLIRTLLVQPRVLLLDEVTSALDLDTTRAILDLILKEKEEKNLTVLAITH
100042

Query: 181   Q 181

Sbjct: 100043N 100045

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/pygr/parse_blast.py", line 174, in
<module>
    for t in p.parse_file(sys.stdin):
  File "/usr/lib/python2.5/site-packages/pygr/parse_blast.py", line 169, in
parse_file
    self.save_subject_line(line)
  File "/usr/lib/python2.5/site-packages/pygr/parse_blast.py", line 80, in
save_subject_line
    self.subject_end=int(c[3])
ValueError: invalid literal for int() with base 10: '100043N'


Possible Bugfix/workaround:

Line 70 in parse_blast.py

currently:
        self.seq_start_char=line.find(c[2]) # IN CASE BLAST SCREWS UP
Sbjct:

could be:
        self.seq_start_char=line[1:].find(c[2])+1 # IN CASE BLAST SCREWS UP
Sbjct: - only search from second character to avoid matches against Q of
Query:

Original issue reported on code.google.com by [email protected] on 22 Sep 2008 at 11:48

refactor setup.py, add distutils & better import test

This wholesale editing of setup.py adds:

 - PEP-8 compliant spacing, commenting, etc.

 - attempt to use setuptools (easy_install) rather than distutils.  setuptools
   is backwards compatible with distutils but supports a bunch of new options,
   including egg generation.

 - switch to imp.find_module to directly attempt import of pyx extension
   modules, rather than path mangling + 'exec import', which is more fragile
   and error prone.

 - direct use of os.path.sep instead of guessing based on os.path.join output

Original issue reported on code.google.com by [email protected] on 21 Jun 2008 at 8:58

Attachments:

0006-refactoring-of-setup.py-to-add-distutils-better-imp.patch

Establish a windows buildbot

Create a Windows buildbot so that we can run tests of the latest code on
Windows.

Original issue reported on code.google.com by [email protected] on 27 May 2008 at 4:40

Make download-progress messages in downloader.py more terse

Right now, if one tries to download genome assemblies from UCSC, urllib 
module prints out too often progress messages. Thus, it is not possible to 
see any other mmessages. We have two choices.

(1) prints out dots per 0.1% not per packets from urllib
(2) prints out information lines per 1% progress

Anyway, we need to get file size information prior to downloading any 
files.

Original issue reported on code.google.com by [email protected] on 22 May 2008 at 6:58

Using pygr in Windows (XP or VISTA)

(2) path joining problem: all path joining by '/' should be replaced with 
os.path.join function.

(3) my.cnf and my.ini: Linux uses my.cnf but windows uses my.ini for 
default MySQL connection (actually not default because we have to give 
that path in MySQLdb.connect). But, MySQLdb module may not work correctly 
because it does not recognize spaces in Windows Path! You may need to save 
the my.ini in C:\ or other directory without spaces in Path. e.g. Don't 
save that file in C:\Document and Settings\myaccount\ or something.

(4) Port opening: It is easy to open port in linux, but windows may think 
that is an illegal attempt for hacking. Thus, windows does not allow any 
port opening. You may have to do something in your firewall settings. I 
tried several ways but every time I did, cmd shell just crashed.

(5) 32-bit, 64-bit: we may need to build several binaries for each 
platforms.

(6) FAT32 vs. NTFS: FAT32 does not support large files. We have to use 
NTFS for building/accessing NLMSAs if it is large. Oh! my USB drives... it 
uses FAT32...

(7) Windows memory: Usually, windows machines do not have larger memory 
than linux. We have to reduce memory requirement if you want to build 
NLMSAs. There are two parameters you need to change. Check forum site for 
more details. Remember that windows does not control user priority like 
linux, 1-99, it just has 6 levels (Windows VISTA) or 3 levels (Windows XP 
or earler). Windows may use up all your CPU resources (like frozen) when 
you are building NLMSAs.

(8) Python versions: There would be no universal binaries (.pyd) for 
windows. We may have to make several windows binary distribution files 
depending on python versions. Now, we have twice many combinations due to 
32-bit and 64-bit.

(9) Windows version. It looks like windows library changed. XP vs. VISTA.

Original issue reported on code.google.com by [email protected] on 13 May 2008 at 11:36

write docs on running & managing a XMLRPC pygr.Data server

e.g.
 * running it behind screen so you can reconnect
 * administering it via remote XMLRPC commands
 * security issues

Original issue reported on code.google.com by [email protected] on 11 Sep 2008 at 4:27

create the MySQL test database if it does not exist, rather than skipping

nolleyal's patch to create the MySQL test database if it does n
ot exist.

This changes the behavior so that the test tries to create the database
rather than skipping the tests if the database does not already exist.
The database is dropped after the tests are run.

The test database name was also changed from 'test' to '_pygrtestdb' on
the grounds that the latter is less likely to conflict with existing
databases.

Original issue reported on code.google.com by [email protected] on 21 Jun 2008 at 8:54

Attachments:

0003-nolleyal-s-patch-to-create-the-MySQL-test-database-i.patch

pygr Pyrex compilation on Windows fails

To reproduce problem:
1. Run 'python setup.py build' in pygr directory

The error message:
pygr\cdict.c: In function `initcdict':
pygr\cdict.c:4516: error: `calloc_int' undeclared (first use in this function)
pygr\cdict.c:4516: error: (Each undeclared identifier is reported only once
pygr\cdict.c:4516: error: for each function it appears in.)
error: command 'gcc' failed with exit status 1

This happens with pygr version 0.7.1 and while running on Windows XP SP2.
The latest versions of Pyrex (0.9.8.4) and MinGW were used (5.1.4). The
earlier version of pygr (0.7) compiles successfully.

Original issue reported on code.google.com by [email protected] on 24 Jun 2008 at 8:04

SQL "view" mapping and graph transformations

allow the user to specify any SQL query or join to define a "view" that
performs a 1:1 or many:many mapping.  Suggested by Jenny Qian.

Original issue reported on code.google.com by [email protected] on 11 Dec 2008 at 5:00

amuralle / pygr Goto Github PK

pygr's People

Contributors

pygr's Issues

Recommend Projects

Recommend Topics

Recommend Org