pandas-dev / pandas Goto Github PK

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Home Page: https://pandas.pydata.org

License: BSD 3-Clause "New" or "Revised" License

Python 90.24% Shell 0.32% HTML 2.02% C 1.56% Smarty 0.04% CSS 0.03% Dockerfile 0.03% XSLT 0.01% Cython 5.71% Meson 0.05%

data-analysis pandas flexible alignment python data-science

pandas's People

Contributors

Stargazers

Watchers

Forkers

pedrot jlsandell ilustreous jseabold jreback timclicks smc77 davidandrzej mlnick dasfaha neurodebian liveink bergtholdt rkabir datakungfu takluyver npinger hammer jeffhsu3 dieterv77 andreas-h lodagro ralphbean benracine joonro fperez gregglind bshanks xuanhan863 asemx theandygross ghosthamlet gwtaylor jonathanrocher ogrisel c0ldlimit guniorobot emlync mattias-lundell wesm ellisonbg pfig charles-cai joskid josef-pkt socialq tvaught jmwenda lahi claudiobertoldi flyingv cournape joshuaar pyeek drewfrank smcinerney greeness danbirken jwkvam blais aflaxman rishabh11 lgautier michaelaye pengyu echlebek wilsaj tkf rmoorman lesteve stefanv rorydonnelly mwiebe dzhou komnomnomnom frrp afonit breisfeld thuske msabramo paddymul lbolla diazona zkluo1 trwhitcomb tlperkins orbitfold khughitt manova jwcornv flyabroad eisenkdr bmu mikelindenau deegrayve moleary taavi kdebrab thejohnnybrown ivanov

pandas's Issues

np.fix doesn't work

In [28]: np.fix(Series([1,2,3], range(3)))
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)

C:\alei1\basic_mktneutral\<ipython console> in <module>()

C:\Python26\lib\site-packages\numpy\lib\ufunclike.pyc in fix(x, y)
     43     x = nx.asanyarray(x)
     44     if y is None:
---> 45         y = nx.zeros_like(x)
     46     y1 = nx.floor(x)
     47     y2 = nx.ceil(x)

C:\Python26\lib\site-packages\numpy\core\numeric.pyc in zeros_like(a)
     92     if isinstance(a, ndarray):
     93         res = ndarray.__new__(type(a), a.shape, a.dtype, order=a.flags.fnc)
---> 94         res.fill(0)
     95         return res
     96     try:

C:\Python26\lib\site-packages\pandas\core\series.pyc in fill(self, value, method)
    822         """
    823         if value is not None:
--> 824             newSeries = self.copy()
    825             newSeries[isnull(newSeries)] = value
    826             return newSeries

C:\Python26\lib\site-packages\pandas\core\series.pyc in copy(self)
    342
    343     def copy(self):
--> 344         return Series(self.values.copy(), index=self.index)
    345
    346 #-------------------------------------------------------------------------------


C:\Python26\lib\site-packages\pandas\core\series.pyc in __new__(cls, data, index, dtype, copy)
    136
    137         if index is None:
--> 138             raise Exception('Index cannot be None!')
    139
    140         # This is to prevent mixed-type Series getting all casted to

i think the bigger problem is overriding the default behavior of the fill() method, which may contribute to other numpy / scipy functions not behaving correctly.

Does pandas support piecewise/multi-period regression?

Hi,

I need to do piecewise or multi-period regression where the breakpoint is determined automatically by the algorithm itself. Does pandas support that? Thanks a lot.
-alex

Make more flexible arithmetic functions in DataFrame/DataMatrix

For example:

df.add(series, axis=0)
df.add(series, axis='index')
df.add(series, axis='columns')

Etc., might be useful in certain cases

Better support for mixed-type data

Across DataFrame and Wide/LongPanel classes

Fix argument inconsistency in pandas.stats.moments exp-weighted functions

Should have min_periods arguments like the other rolling moment functions

Old bug opened on google code showed that workaround for the pypi install failures should be to download latest source and compile as such:
python setup.py build --compiler=mingw32
python setup.py install

tseries.c has two errors stopping compilation

pandas\lib\src\tseries.c: In function '__Pyx_c_absf':
pandas\lib\src\tseries.c:18906:25: error: #if with no expression
pandas\lib\src\tseries.c: In function '__Pyx_c_abs':
pandas\lib\src\tseries.c:19026:25: error: #if with no expression
pandas\lib\src\tseries.c: At top level:

should that be #ifdef?

Importing data using HDFStore with pre-epoch dates; "ValueError: timestamp out of range for platform localtime()/gmtime() function"

I have data with a DataFrame that goes back to 1949. I imported it from a csv into a hdf5 using HDFStore. That went fine, but when reading from the HDFStore to get a DF back, I get the below stack trace. When looking at the data in the store I see that the index has negative values for preepoch times...

ValueError: timestamp out of range for platform localtime()/gmtime() function
File "C:\dev\MktDB\test_continuation.py", line 59, in
main()
File "C:\Python27\lib\site-packages\pandas-0.3.0-py2.7-win32.egg\pandas\io\pytables.py", line 157, in _read_group
File "C:\Python27\lib\site-packages\pandas-0.3.0-py2.7-win32.egg\pandas\io\pytables.py", line 173, in _read_frame
File "C:\Python27\lib\site-packages\pandas-0.3.0-py2.7-win32.egg\pandas\io\pytables.py", line 210, in _read_index
File "C:\Python27\lib\site-packages\pandas-0.3.0-py2.7-win32.egg\pandas\io\pytables.py", line 227, in _unconvert_index
File "C:\Users\Shon\AppData\Roaming\Python-Eggs\pandas-0.3.0-py2.7-win32.egg-tmp\pandas\lib\tseries.pyd", line 45, in tseries.array_to_datetime (pandas\lib\src\tseries.c:14378)
File "C:\Users\Shon\AppData\Roaming\Python-Eggs\pandas-0.3.0-py2.7-win32.egg-tmp\pandas\lib\tseries.pyd", line 20, in tseries.to_datetime (pandas\lib\src\tseries.c:13910)

Any guidance would be most appreciated.
Shon

resource hog -> .asfreq(DateOffset(seconds=1),method='pad')

>>> df2['a']
2010-12-01 00:00:00    1
2010-12-02 00:00:00    2
2010-12-03 00:00:00    3
2010-12-04 00:00:00    4
>>> df2['a'].asfreq(pn.DateOffset(seconds=1),method='pad')

## NOTE: The above one-liner runs for over 20 minutes on a 2Ghz Xeon, python 2.5.2, numpy 1.5.1,
##       & pandas head from 2011-01-05.  It also consumed > 15% (500MB!!!) of available DRAM
##       before I manually killed it
[mpenning@Bucksnort tickdata]$ free
             total       used       free     shared    buffers     cached
Mem:       3894944    3771540     123404          0     192340    2328984
-/+ buffers/cache:    1250216    2644728
Swap:      2830328        596    2829732
[mpenning@Bucksnort tickdata]$ cat /proc/cpuinfo
...
processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.00GHz
stepping        : 7
cpu MHz         : 1995.840
cache size      : 512 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 7
initial apicid  : 7
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips        : 3991.51
clflush size    : 64
power management:

[mpenning@Bucksnort tickdata]$

Slice DataMatrix-- get a view

add copy keyword to .xs() method. Won't work with DataFrame, should raise Exception

Fix pandas.io.parsers.parseExcel to be more robust / have correct handling of Excel dates

Improve fancy indexing when selecting cross-sections

Something like

df.ix[0, ['C', 'B', A']]

is currently performing a wasteful reindex(columns=['C', 'B', A']), need to have some better way to infer the "right" order to perform the operations in

.asfreq() dies when the index values have tzinfo associated with them

>>> # dataframe is indexed at a second granularity, using eastern time
>>> dataframe                                                                   <class 'pandas.core.frame.DataFrame'>
Index: 24676 entries , 2010-10-04 00:03:49-04:00 to 2010-10-04 23:59:55-04:00
etf               10897  non-null values
etfvol            10897  non-null values
fut               17988  non-null values
futvol            17988  non-null values
ticx              7880  non-null values
vix               7465  non-null values
vixvol            7465  non-null values

>>> dataframe['etf'].index[0]
datetime.datetime(2010, 10, 4, 0, 3, 49, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
>>>
>>> # et is a pytz timezone object for eastern time...
>>> et
<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>

>>> dataframe['fut'].asfreq(et.localize(pn.DateOffset(seconds=1)), method='pad')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/pytz-2010o-py2.5.egg/pytz/tzinfo.py", line 262, in localize
    if dt.tzinfo is not None:
AttributeError: 'DateOffset' object has no attribute 'tzinfo'
>>>
>>> dataframe['fut'].asfreq(pn.DateOffset(seconds=1), method='pad')             Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/pandas/core/series.py", line 1103, in asfreq
    dateRange = DateRange(self.index[0], self.index[-1], offset=freq)
  File "/usr/lib/python2.5/site-packages/pandas/core/daterange.py", line 70, in __new__
    fromInside = start is not None and start > _CACHE_START
TypeError: can't compare offset-naive and offset-aware datetimes
>>>

Is there a single pdf instead of html documentation/help file?

This may not be the best place to post this question b/c it is not an "issue", just a question that I don't know where else I can post it to. I am just wondering whether there is a single pdf help or documentation file, instead of html help pages on the sourceforge website. Also, I am wondering whether we have a faq or forum that people discuss about usage/tutorial/success stories etc on using pandas, other than this one? may be pystatmodel? Thanks again, Wes for creating this package! -alex

Enable element-wise comparison operations in DataMatrix objects

re: pystatsmodels e-mail

hi everyone,

just getting started with pandas and i was wondering if someone could
help me out. do pandas.DataMatrix objects support per item comparison
operations?

i have a two data matrices, and i want to do something like this:

div[div > 0.5 * price] = 0

this would work if div and price were numpy.ndarray objects. any idea
how i would do something like this with pandas.DataMatrix objects?

thanks,
andy

Join methods for WidePanel

e.g.

def create_panels_join(a, b):
""" return a join of the 2 panels """
d = dict([ (i,(a[i].T.join(v.T)).T) for i, v in b.iteritems() ])
return p.WidePanel.fromDict(d)

this is a 'more' correct version, but have to assure that a[i] is not None

def create_panels_join(a, b):
""" return a join of the 2 panels """
items = set(a.items) + set(b.items)
WidePanel.fromDict(dict([ (i,(a[i].T.join(b[i].T)).T) for i in items ]))

Optimized DataFrame.append for block structure if possible

Adding variable description in DataFrame

Hi,

I think it would be great if DataFrame is able to store additional property (list of strings) which contains the description for each variable (column). And then it can be shown by themselves with variable names as well as in the output of pandas.DataFrame.info(). I think with this DataFrame will be pretty much self contained. Right now I need to keep additional object or a text file which contains those descriptions.

Please let me know what do you think!

-Joon

Incorrect T-Stats in FamaMacBeth.py

Small bug in the FamaMacBeth class at lines 105-106-- the printed form of the regression results gives the STD twice, rather than the t-stat:
std_beta = self._results['std_beta'][i] t_stat = self._results['std_beta'][i]

Last one should be t_stat instead of std_beta the second time.

numpy traceback during easy_install on Debian Lenny

I'm not sure whether this is due to a problem with my local version of numpy or something else (in perhaps pandas?)... I did not see this previously (same numpy install and other dependencies) when I installed from Github trunk (sometime around November 15, 2010)...

[mpenning@Bucksnort ~]$ uname -a
Linux Bucksnort 2.6.26-2-686 #1 SMP Thu Sep 16 19:35:51 UTC 2010 i686 GNU/Linux
[mpenning@Bucksnort ~]$ cat /etc/issue
Debian GNU/Linux 5.0 \n \l

[mpenning@Bucksnort ~]$ sudo easy_install -U pandas
[sudo] password for mpenning: 
Searching for pandas
Reading http://pypi.python.org/simple/pandas/
Reading http://pandas.sourceforge.net
Best match: pandas 0.3.0.beta
Downloading http://pypi.python.org/packages/source/p/pandas/pandas-0.3.0.beta.tar.gz#md5=18a39d6aa5df2f3515bada968554f049
Processing pandas-0.3.0.beta.tar.gz
Running pandas-0.3.0.beta/setup.py -q bdist_egg --dist-dir /tmp/easy_install-24C3ig/pandas-0.3.0.beta/egg-dist-tmp-Lc7eTD
warning: no files found matching 'LICENSE.txt'
warning: no files found matching 'README.txt'
/usr/lib/python2.5/site-packages/numpy/core/include/numpy/__ufunc_api.h:197:  warning: ?_import_umath? defined but not used
pandas/lib/src/tseries.c:1535: warning: ?__pyx_f_7tseries_get_int16_ptr? defined but not used
pandas/lib/src/tseries.c:1572: warning: ?__pyx_f_7tseries_get_int32_ptr? defined but not used
pandas/lib/src/tseries.c:1609: warning: ?__pyx_f_7tseries_get_int64_ptr? defined but not used
pandas/lib/src/tseries.c:1646: warning: ?__pyx_f_7tseries_get_double_ptr? defined but not used
zip_safe flag not set; analyzing archive contents...
Adding pandas 0.3.0.beta to easy-install.pth file

Installed /usr/lib/python2.5/site-packages/pandas-0.3.0.beta-py2.5-linux-i686.egg
Processing dependencies for pandas
Finished processing dependencies for pandas
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.5/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.5/site-packages/numpy/distutils/misc_util.py", line 251, in clean_up_temporary_directory
    from numpy.distutils import log
SystemError: Parent module 'numpy.distutils' not loaded
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.5/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.5/site-packages/numpy/distutils/misc_util.py", line 251, in clean_up_temporary_directory
    from numpy.distutils import log
SystemError: Parent module 'numpy.distutils' not loaded
[mpenning@Bucksnort ~]$ python
Python 2.5.2 (r252:60911, Jan 24 2010, 14:53:14) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> quit()
[mpenning@Bucksnort ~]$

Time zone handling in DateRange, etc. using pytz

Working on this at the moment, but still more to do

DataMatrix constructor ignores dtype argument if data is a DataMatrix

>>> a = DataMatrix([[1.0,2.0,3.0],[4.0,5.0,6.0]], range(2), range(3))
>>> b = DataMatrix(a, dtype=int)
>>> b.values.dtype
dtype('float64')

Not sure if this was a design decision or not, but it seems like the dtype of b.values should be int.

reindex_like function

In Series, DataMatrix, WidePanel, etc.

Improvements to pandas.io.pytables / unit testing

Need to incorporate the selection of ranges of data and write unit tests

Add PyTable Paths to HDFStore

Hello,
I not sure if this was the best way to suggest improvements, but here I go anyway :)

I really like the easy of the HDFStore (and the entire project for that matter), but I wanted the ability to store and retrieve DataFrames in Groups below the root. .i.e

h5 = HDFStore('test.h5')
h5['/groups/below/theroot'] = DataFrame(data, index)

df =  h5['/groups/below/theroot']

I made the following changes to HDFStore class to do this, and am now using it in production code. If you agree this is something useful then I would like this to become apart of the main code base, using your own approach or the below.

I changed the repr, getitem, and _write_group functions in pandas.io.pytables.

def __repr__(self):
    output = str(self.__class__) + '\n'

    #Exstract path and kind of all 'pandas_type' pytable Groups.
    keys, values = zip(*((x._v_pathname, x._v_attrs.pandas_type) for x in self.handle.walkGroups() if hasattr(x._v_attrs,'pandas_type')))

    output += adjoin(5, keys, values)
    return output



def __getitem__(self, key):

    if not key[0] == '/': #Then add root slash so we can use getNode belwo
        key = '/' + key

    group = self.handle.getNode(key)
    return _read_group(group)



def _write_group(self, key, value):
    root = self.handle.root


    if key[0] == '/': #Assume they want a nested pytable Group
        final_slash = key.rfind('/')
        where = key[:final_slash]
        name = key[final_slash + 1:]
    else:
        where = '/'
        name = key

    try:
        group = self.handle.getNode(key)
    except:
        group = self.handle.createGroup(where, name, createparents=True)


    kind = type(value)
    handler = self._get_write_handler(kind)

    try:
        handler(group, value)
    except Exception:
        raise

    group._v_attrs.pandas_type = kind.__name__
    return True

please let me know if there is questions or concerns... I made it so that it still works just like the original if the user doesn't need groups...

Thanks for a great project.

Consistent column ordering in mixed-type DataMatrix

As floats and non-floats are currently segregated, column ordering not reliable. Thus, behavior between DataFrame and DataMatrix is not identical

Rename {Series, DataFrame, WidePanel}.fill() to something else

Will need to give users advance warning of API breakage

Install problems

pypi install of pandas still isn't working on 2.7, even with all dependencies

Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

import pandas
Traceback (most recent call last):
File "", line 1, in
File "pandas__init__.py", line 12, in
from pandas.core.api import *
File "pandas\core\api.py", line 8, in
from pandas.core.common import isnull, notnull
File "pandas\core\common.py", line 6, in
import pandas.lib.tseries as tseries
ImportError: No module named tseries

parseCSV on result of toCSV does not preserve index type

From Google Code: http://code.google.com/p/pandas/issues/detail?id=14

First of all, thanks very much for this very nice project. I find it extremely useful.

I ran into a very minor issue:

What steps will reproduce the problem?

Create a DataFrame with date objects in the index
Write to csv using toCSV method
Read DataFrame using parseCSV method

What is the expected output? What do you see instead?
I had hoped that the index of the parsed DataFrame would
consist of date objects, instead they are strings.

What version of the product are you using? On what operating system?
Trunk, revision 202. Ubuntu 10.10

Please provide any additional information below.
The reason is that toCSV writes the first column header as "index"
If it were to leave it blank, things work as I expect.

examples not running, error in documentation

Both examples do not run. In finance.py, on line 83:

 filledRatio = ibm / ibmMonthly.reindex(ibm.index, fillMethod='pad')

should be:

 filledRatio = ibm / ibmMonthly.reindex(ibm.index, method='pad')

As reindex is defined in core/series.py as

 def reindex(self, index=None, method=None):

This error is repeated in the documentation at http://pandas.sourceforge.net (reindex uses the argument fillMethod instead of method).

running regressions.py, I get the following error:

Traceback (most recent call last):
File "regressions.py", line 30, in
model = ols(y=Y, x=X)
File "/usr/local/lib/python2.6/dist-packages/pandas/stats/interface.py", line 117, in ols
return klass(**kwargs)
File "/usr/local/lib/python2.6/dist-packages/pandas/stats/ols.py", line 56, in init
self.sm_ols = sm.OLS(self._y_raw, self._x.values).fit()
AttributeError: 'module' object has no attribute 'OLS'

It's possible I'm somehow combining incompatible versions of pandas, but I don't see how, considering I downloaded pandas today, once. The first bug, at least seems like a simple code error.

Hopefully this is of some help,
Colum

Install 0.2beta fail on Debian Lenny / python2.5

Using a stock debian lenny box with:

numpy-1.3.0 built from source
scipy 0.8.0 built from source
matplotlib-1.0.0 built from source
pandas-0.2beta (note: no problems with latest github pull)

Errors: Install 0.2beta

mpenning@Bucksnort:~/src/pandas-0.2beta$ sudo python setup.py install
running install
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
running build_src
building extension "pandas.lib.tseries" sources
pandas.lib - nothing done with h_files = ['pandas/lib/include/wirth.h']
building data_files sources
running build_py
creating build
creating build/lib.linux-i686-2.5
creating build/lib.linux-i686-2.5/pandas
copying pandas/init.py -> build/lib.linux-i686-2.5/pandas
copying pandas/info.py -> build/lib.linux-i686-2.5/pandas
copying pandas/version.py -> build/lib.linux-i686-2.5/pandas
copying pandas/setup.py -> build/lib.linux-i686-2.5/pandas
creating build/lib.linux-i686-2.5/pandas/core
copying pandas/core/mixins.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/init.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/series.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/collection.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/index.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/panel.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/pytools.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/matrix.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/daterange.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/frame.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/common.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/datetools.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/groupby.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/setup.py -> build/lib.linux-i686-2.5/pandas/core
copying pandas/core/api.py -> build/lib.linux-i686-2.5/pandas/core
creating build/lib.linux-i686-2.5/pandas/io
copying pandas/io/init.py -> build/lib.linux-i686-2.5/pandas/io
copying pandas/io/parsers.py -> build/lib.linux-i686-2.5/pandas/io
creating build/lib.linux-i686-2.5/pandas/lib
copying pandas/lib/init.py -> build/lib.linux-i686-2.5/pandas/lib
copying pandas/lib/build.py -> build/lib.linux-i686-2.5/pandas/lib
copying pandas/lib/bench.py -> build/lib.linux-i686-2.5/pandas/lib
copying pandas/lib/setup.py -> build/lib.linux-i686-2.5/pandas/lib
creating build/lib.linux-i686-2.5/pandas/sandbox
copying pandas/sandbox/init.py -> build/lib.linux-i686-2.5/pandas/sandbox
creating build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/plm.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/init.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/ols.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/math.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/interface.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/common.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/fama_macbeth.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/setup.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/var.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/moments.py -> build/lib.linux-i686-2.5/pandas/stats
copying pandas/stats/api.py -> build/lib.linux-i686-2.5/pandas/stats
creating build/lib.linux-i686-2.5/pandas/util
copying pandas/util/init.py -> build/lib.linux-i686-2.5/pandas/util
copying pandas/util/decorators.py -> build/lib.linux-i686-2.5/pandas/util
copying pandas/util/testing.py -> build/lib.linux-i686-2.5/pandas/util
running build_ext
customize UnixCCompiler
customize UnixCCompiler using build_ext
building 'pandas.lib.tseries' extension
compiling C sources
C compiler: gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

creating build/temp.linux-i686-2.5
creating build/temp.linux-i686-2.5/pandas
creating build/temp.linux-i686-2.5/pandas/lib
creating build/temp.linux-i686-2.5/pandas/lib/src
compile options: '-I/usr/lib/python2.5/site-packages/numpy/core/include -Ipandas/lib/include/ -I/usr/lib/python2.5/site-packages/numpy/core/include -I/usr/include/python2.5 -c'
gcc: pandas/lib/src/wirth.c
gcc: pandas/lib/src/tseries.c
/usr/lib/python2.5/site-packages/numpy/core/include/numpy/__ufunc_api.h:183: warning: ?_import_umath? defined but not used
pandas/lib/src/tseries.c:1397: warning: ?__pyx_f_7tseries_get_int16_ptr? defined but not used
pandas/lib/src/tseries.c:1434: warning: ?__pyx_f_7tseries_get_int32_ptr? defined but not used
pandas/lib/src/tseries.c:1471: warning: ?__pyx_f_7tseries_get_int64_ptr? defined but not used
pandas/lib/src/tseries.c:1508: warning: ?__pyx_f_7tseries_get_double_ptr? defined but not used
gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions build/temp.linux-i686-2.5/pandas/lib/src/tseries.o build/temp.linux-i686-2.5/pandas/lib/src/wirth.o -o build/lib.linux-i686-2.5/pandas/lib/tseries.so
running scons
running install_lib
copying build/lib.linux-i686-2.5/pandas/lib/tseries.so -> /usr/lib/python2.5/site-packages/pandas/lib
running install_data
running install_egg_info
Removing /usr/lib/python2.5/site-packages/pandas-0.2beta.egg-info
Writing /usr/lib/python2.5/site-packages/pandas-0.2beta.egg-info
mpenning@Bucksnort:/src/pandas-0.2beta$ cd
mpenning@Bucksnort:/$
mpenning@Bucksnort:~$ python
Python 2.5.2 (r252:60911, Jan 24 2010, 14:53:14)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pandas
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.5/site-packages/pandas/init.py", line 14, in
from pandas.stats.api import *
File "/usr/lib/python2.5/site-packages/pandas/stats/api.py", line 7, in
from pandas.stats.moments import *
File "/usr/lib/python2.5/site-packages/pandas/stats/moments.py", line 365, in
_rolling_sum = _wrap_cython(tseries.roll_sum)
AttributeError: 'module' object has no attribute 'roll_sum'

Plot keyword arguments are unused in DataFrame plot()

Reported by twhitcomb, Jul 27, 2010
What steps will reproduce the problem?

Import pandas
Create a sample DataFrame
Plot the DataFrame using a linewidth keyword argument.

What is the expected output? What do you see instead?
I expect to see the linewidth keyword argument passed through to the plotting routine. Instead, the plot is displayed with the default linewidth (see attached figure).

What version of the product are you using? On what operating system?

pandas.version
0.20000000000000001
Microsoft Windows Vista, Python(x,y) 2.6.5.1

Please provide any additional information below.
Looking at the plot function in frame.py it's obvious why this is happening:
def plot(self, kind='line', **kwds): # pragma: no cover
from pylab import plot

    for col in sorted(self.columns):
        s = self[col]
        plot(s.index, s, label=col)

Note that **kwds is not used in the plot command.

If I load a new function into my workspace like
def plot_frame(frame, *_kwargs):
from pylab import plot
for col in sorted(frame.columns):
s = frame[col]
plot(s.index, s, label=col, *_kwargs)

Then the arguments are correctly passed on, and I get the proper response, as shown in the other attached figure.

Create generic moving window function

Probably no way to make more efficient than naive Python loop, but at least more convenient

All days instead of business days?

Hi,

I am very new to github and miss a discussion board for pandas. Not sure if issues is the right place for my question, so feel free to delete it, but please point to the right place for questions.

The question: I need to aggregate hourly time series to the daily scale, that would include all days, 7 per week. Currently i find only bday class in datetools:

daily_ts = ts.groupby(lambda x: datetools.bday(x)).aggregate(np.nansum)

But bday will skip weekends. datetools.day seem to return timeseries with the same time resolution.
How to include weekend days?

Many thanks for developing pandas, it's truly nice, and really fast, even when working with large datasets!

DataFrame and DataMatrix column ordering

First, thank you for the pandas package -- it's incredibly useful and well done.

I know that one of the fundamental concepts behind the data structures is that column ordering doesn't matter. And, as long as one only uses pandas' data access/manipulation functions (eg, sum(), ewma(), etc.), this works fine. But often, it's useful to access the underling values in a numpy array for some more complicated data manipulation. Using the values attribute (or values() method for a series) does this, but it's not always obvious what order the values come back in.

For example:

In [1]: dm = DataMatrix(np.arange(2*3).reshape(2,3), index=[1,0], columns=['B', 'A', 'C' ])

In [2]: dm 
Out[2]: 
     B              A              C  
1    0              1              2  
0    3              4              5              

In [3]: df = DataFrame(np.arange(2*3).reshape(2,3), index=[1,0], columns=['B', 'A', 'C' ])

In [4]: df 
Out[4]: 
     A              B              C  
1    1              0              2  
0    4              3              5              

In [5]: df.values
Out[5]: 
array([[1, 0, 2],
       [4, 3, 5]])

In [6]: dm.values
Out[6]: 
array([[0, 1, 2],
       [3, 4, 5]])

DataMatrix seems to respect the passed in ordering of columns, while DataFrame does not. I know this is documented, and not the biggest deal in the world, but does seem to cause quite a bit of confusion for some. Is it possible to have both data types keep the ordering that's passed in? If a user passes in the same column name twice, could this just throw an exception? Something stills need to be done when an operation is performed on two DataFrames (eg, combining them), but instead of reordering in alphabetical order, how about preserving the column ordering from left to right?

Anyway, my bigger concern is actually the following:

In [7]: dm.reindex(columns=['C','B','A']).values
Out[7]: 
array([[2, 0, 1],
       [5, 3, 4]])

In [8]: df.reindex(columns=['C','B','A']).values
Out[8]: 
array([[1, 0, 2],
       [4, 3, 5]])

Regardless of the ordering of the columns after creating a DataFrame/Matrix, a naive users (ie, me) would expect calling reindex and values would return an ndarray with the columns in the same order as was requested. But it looks like this only happens for DataMatrixes (and I'm not even sure that's always guaranteed).

Configure pandas like pyzmq so checking in .c files is not necessary

No reason to be checking in the C files and diffs

Binary operations on int DataMatrix

Reported by kwgoodman, Jun 12, 2010
When adding two int DataMatrix's a ValueError is raised when matrix.py attempt to fill missing values with NaN:

dma1 = pandas.DataMatrix([[1, 2], [3, 4]], ['a', 'b'], ['c', 'd'])
dma2 = pandas.DataMatrix([[1, 2], [3, 4]], ['b', 'a'], ['d', 'c'])
dma1 + dma2

ValueError: cannot convert float NaN to integer

Delete comment Comment 1 by kwgoodman, Jun 12, 2010
Possible fix:

x = np.array([1.0])
issubclass(x.dtype.type, np.inexact)
True
x = np.array([1])
issubclass(x.dtype.type, np.inexact)
False
x = np.array([1.0], dtype=object)
issubclass(x.dtype.type, np.inexact)
False
x = np.array([1.0], dtype=str)
issubclass(x.dtype.type, np.inexact)
False

weights option may not be working in pandas.stats.ols.OLS

Need to investigate (user notified)

reindex(method="backfill") and periods within the last date

I have a DataFrame with 1 minute timestamps, pw.

I have a monthly average of the values:
ma_pw=pw.groupby(BMonthEnd(+1).rollforward).aggregate(np.mean)

When I do a reindex to push that average back into all of the periods for the month,
the last day's values are NaN. Here's my reindex attempt:

reindexed = ma_pw.reindex(pw.index, method="backfill").

I have worked around this by forcing the hour in my monthly data to something past the last minute that I care about in my by-minute data as follows, but this seems incredibly crude:

from dateutil.relativedelta import *
newi = [x + relativedelta(hour=20) for x in ma_pw.index]
ma_pw.index = newi[:]

Then my original reindex works.

Is there a clean way to instruct the reindex method to include the datetime objects having the same datetime.date on that last day of the month?

numpy.dtype size changed runtime warning on XP

I've installed numpy from numpy-1.5.1-win32-superpack-python2.6.exe
and pandas from pandas-0.2.win32-py2.6.exe.

When I import * from pandas I get the following warnings:

Warning (from warnings module):
File "C:\Program Files\Python26\lib\site-packages\pandas\core\index.py", line 7
from pandas.lib.tseries import map_indices
RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility

Warning (from warnings module):
File "C:\Program Files\Python26\lib\site-packages\pandas\core\index.py", line 7
from pandas.lib.tseries import map_indices
RuntimeWarning: numpy.flatiter size changed, may indicate binary incompatibility

I've had no issues yet using pandas interactively in the window (idle), but if I have embedded the import in a module then I am not able to execute any of the code from the module.

Thoughts?

.apply() API consistency fix

DataFrame.apply, Series.applymap, Series.map sort of come across as inconsistent.

def f(x):
    return x.apply(lambda y: 2*y)
def g(x):
    return x.applymap(lambda y: 2*y)

a = TimeSeries([1,2],[1,2])
b = DataMatrix({'a':a})

f(b)
g(b)
f(a)    
g(a)

"Proper" boolean array with NA handling in DataMatrix

Currently booleans are getting casted to floats in some circumstances in order to handle NAs. Need to devise a workable scheme for boolean data possibly containing NAs.

Outlier detection in pandas.stats.moments functions

Floating point error can result in incorrect output in the rolling_* functions (need unit tests)

Install problems on Windows 7

I'm on a new computer with Windows 7.

I've installed the 32 bit versions of python-2.7.1 and all subsequent requirements.

pandas-0.3.0.win32-py2.7.exe does not complete. It just stops and windows claims it is not responding. The same phenomenon happened installing tables-2.2.1.win32-py2.7.exe. I'm assuming that pytables is recommended but not required for pandas. Anyway, both of these installs do not work.

Suggestions?

Python 2.7 testing

Test pandas on Python 2.7

NumPy >= 1.4.0 NaN-handling issues

There are a few potential NaN-casting problems floating around the codebase, e.g.:

df1 = pandas.DataFrame({'x':[5]})
df2 = pandas.DataFrame({'x':[1]})
df1.combineAdd(df2)

Better implementation of rolling_max and min

Currently using skiplist, unnecessary

HDFStore testing and improvements

Per many requests...just a place holder to remind me to do it

Consistent column ordering between DataFrame and DataMatrix

To allow the classes to have more nearly identical behavior

what is the easiest way to plot a timeseries and dataframe?

it looks like ts.plot and df.plot is not implemented yet. I watched Wes's PyCon 2010 video where he use some functions fplot (presumably it is defined in the pycon_demo.py that was run at the beginning of the demo). pandas is a great package. Hope it will be kept active and keep expanding. Great work. Thanks.

dtype mismatch when calling .median()

Using python 2.5.2, pandas 0.3 and numpy 1.5.1 on a debian lenny box... mean() works, but not median()

>>> df1['b'].truncate(before=dt.date(2010,12,2),after=dt.date(2010,12,4))
2010-12-02    6
2010-12-03    7
2010-12-04    8
>>> df1['b'].truncate(before=dt.date(2010,12,2),after=dt.date(2010,12,4)).median()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "build/bdist.linux-i686/egg/pandas/core/series.py", line 486, in median
 File "moments.pyx", line 70, in tseries.median (pandas/lib/src/tseries.c:5967)
 File "moments.pyx", line 28, in tseries.kth_smallest (pandas/lib/src/tseries.c:5539)
ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long'
>>> df1['b'].truncate(before=dt.date(2010,12,2),after=dt.date(2010,12,4)).mean()
7.0