bashtage / linearmodels Goto Github PK

View Code? Open in Web Editor NEW

885.0 26.0 179.0 126.88 MB

Additional linear models including instrumental variable and panel data models that are missing from statsmodels.

Home Page: https://bashtage.github.io/linearmodels/

License: University of Illinois/NCSA Open Source License

Python 99.57% Shell 0.19% Cython 0.24%

iv instrumental-variable panel regression statistical-model ols gmm fixed-effects random-effects between-estimator

linearmodels's Introduction

Linear Models

Metric
Latest Release
Continuous Integration
Coverage
Code Quality

Citation

Linear (regression) models for Python. Extends statsmodels with Panel regression, instrumental variable estimators, system estimators and models for estimating asset prices:

Panel models:
- Fixed effects (maximum two-way)
- First difference regression
- Between estimator for panel data
- Pooled regression for panel data
- Fama-MacBeth estimation of panel models
High-dimensional Regresssion:
- Absorbing Least Squares
Instrumental Variable estimators
- Two-stage Least Squares
- Limited Information Maximum Likelihood
- k-class Estimators
- Generalized Method of Moments, also with continuously updating
Factor Asset Pricing Models:
- 2- and 3-step estimation
- Time-series estimation
- GMM estimation
System Regression:
- Seemingly Unrelated Regression (SUR/SURE)
- Three-Stage Least Squares (3SLS)
- Generalized Method of Moments (GMM) System Estimation

Designed to work equally well with NumPy, Pandas or xarray data.

Panel models

Like statsmodels to include, supports formulas for specifying models. For example, the classic Grunfeld regression can be specified

import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)
# MultiIndex, entity - time
data = data.set_index(['firm','year'])
from linearmodels import PanelOLS
mod = PanelOLS(data.invest, data[['value','capital']], entity_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)

Models can also be specified using the formula interface.

from linearmodels import PanelOLS
mod = PanelOLS.from_formula('invest ~ value + capital + EntityEffects', data)
res = mod.fit(cov_type='clustered', cluster_entity=True)

The formula interface for PanelOLS supports the special values EntityEffects and TimeEffects which add entity (fixed) and time effects, respectively.

Formula support comes from the formulaic package which is a replacement for patsy.

Instrumental Variable Models

IV regression models can be similarly specified.

import numpy as np
from linearmodels.iv import IV2SLS
from linearmodels.datasets import mroz
data = mroz.load()
mod = IV2SLS.from_formula('np.log(wage) ~ 1 + exper + exper ** 2 + [educ ~ motheduc + fatheduc]', data)

The expressions in the [ ] indicate endogenous regressors (before ~) and the instruments.

Installing

The latest release can be installed using pip

pip install linearmodels

The main branch can be installed by cloning the repo and running setup

git clone https://github.com/bashtage/linearmodels
cd linearmodels
pip install .

Documentation

Stable Documentation is built on every tagged version using doctr. Development Documentation is automatically built on every successful build of main.

Plan and status

Should eventually add some useful linear model estimators such as panel regression. Currently only the single variable IV estimators are polished.

Linear Instrumental variable estimation - complete
Linear Panel model estimation - complete
Fama-MacBeth regression - complete
Linear Factor Asset Pricing - complete
System regression - complete
Linear IV Panel model estimation - not started
Dynamic Panel model estimation - not started

Requirements

Running

Python 3.9+
NumPy (1.22+)
SciPy (1.8+)
pandas (1.4+)
statsmodels (0.12+)
formulaic (1.0.0+)
xarray (0.16+, optional)
Cython (3.0.10+, optional)

Testing

py.test

Documentation

sphinx
sphinx-immaterial
nbsphinx
nbconvert
nbformat
ipython
jupyter

linearmodels's People

Contributors

Stargazers

Watchers

Forkers

wycharry benjamesbabala juzenn ambier mindis natashawatkins nhmatheson mmngreco dmly vishalbelsare chine007 winteraspect jayvischeng mlettau oztalha anhnguyendepocen liam-f ajrahman limingbei guangyi-z miladmahdavilayen bradleydi pabla007 jeremytian2019 hamdan88 noisyoscillator alanzhong aaptedata chat19 knut0815 matbuechner fsonmez cheikhnokho ancardona pyatachokk markusj1201 lycanthropes caozq19 chetanmehra leofrota boyangzhou os-harry jlian401 paulkahura eugenepy elioneyang bin-miao lnsongxf jwjjkkufl tim-xian alexchen2351 oguzkirman codacy-badger eirki rmribeir khwilson weifeng191134 afmirza aqifilyaskhan krogerkai lystahi zuoxiaofan amrofi rickchen0910 huning2009 acszczep nwakhidah arlionn lkhoho spoonia1 hotessy humepac chenfeisun shenseanchen junhe-s dipsingh sohailkhanmarwat jmonteroers hubayirp cli21 cj5815 subidita-262 steffenbuw macrofinancehub e-bilgin sdufejiangtao tomsb459 hysaint andrea-dm hamiltonwang jinyk44 mariusgruenewald thrasibule diligentwang1998 olivier2311 peteos123 mavpanos jstriaukas pepelovesvim xiaolin1245

linearmodels's Issues

Link on top of your github page has extra s

Hello,

This looks interesting.

The link on the top of https://github.com/bashtage/linearmodels points to

https://bashtage.github.io/linearmodels/docs

I think there is an extra "s" in "docs"

LinearFactorModels: Issue when number of portfolios larger than number of observations

I have been running into a problem when using the LinearFactorModel functionality, that only appears to occur when the number of portfolios is larger than the number of observations per portfolio. The error message pertains to a constant being present (which should not be the case here). My guess is that the has_constant util delivers a faulty output in this use case. The error received is the following:

File ".../anaconda3/lib/python3.6/site-packages/linearmodels/asset_pricing/model.py", line 331, in __init__ super(LinearFactorModel, self).__init__(portfolios, factors) File ".../anaconda3/lib/python3.6/site-packages/linearmodels/asset_pricing/model.py", line 67, in __init__ self._validate_data() File ".../anaconda3/lib/python3.6/site-packages/linearmodels/asset_pricing/model.py", line 102, in _validate_data raise ValueError('portfolios must not contains a constant or equivalent.') ValueError: portfolios must not contains a constant or equivalent.

To aid the debugging, I have constructed a minimal example. Please see below:

import pandas as pd
import numpy as np
from linearmodels.asset_pricing import LinearFactorModel
PF = pd.read_pickle('testPFs.pkl')
factors = pd.read_pickle('testfactors.pkl')

model = LinearFactorModel(PF,factors)
modelres = model.fit()

The necessary data is made available at the following location: https://www.dropbox.com/sh/h6iwxw8t6h0njig/AABcguLzp7XrmSXJ8wuYdygja?dl=0

Any help is very appreciated!

Dummy variable regression in PooledOLS generates an error

I conducted a dummy variable regression based on this page, using linearmodels (PooledOLS) and statsmodels with the same formula (see below). The latter finishes computation and generates a result, whereas linearmodels produces an error message:

ValueError: exog does not have full column rank.

I wonder what is happening in linearmodels. Does a bug cause this? Can you shed light on this?

import pandas as pd
from linearmodels.datasets import wage_panel
from linearmodels.panel import PooledOLS
from statsmodels.formula.api import ols

data = wage_panel.load()
year = pd.Categorical(data.year)
nr = pd.Categorical(data.nr)
data = data.set_index(['nr', 'year'])
data['year'] = year
data['nr'] = nr

# formula
formula = 'lwage ~ married + union + educ + year + year:educ + nr'

# statsmodels -----------
result_sm = ols(formula, data=data).fit()
# show selected parameters
condition = result_sm.params.index.str.contains('nr').tolist()
result_sm.params[[not i for i in condition]]

# linearmodels ----------
result_lm = PooledOLS.from_formula(formula, data=data).fit()

# generate the error message 'ValueError: exog does not have full column rank.'

Remove pandas Panel

Pandas panel has been deprecated and so should be removed from internal use, but not as a data format.

Wrong time period counts

Maybe this is just a weird interpretation thing but the years in my dataset span 2005 to 2017 or 14 periods, yet the output of the PanelOLS claims up to 28 periods. Any idea why this might be happening? I can't share the dataset. Not sure of what a minimal sample to replicate might look like but I can look into this when I have more time. So just a placeholder for now.


                                       PanelOLS Estimation Summary                                        
==========================================================================================================
Dep. Variable:     log(Y)   R-squared:                        0.5801
Estimator:                                             PanelOLS   R-squared (Between):              0.9087
No. Observations:                                          3896   R-squared (Within):               0.5801
Date:                                          Fri, Mar 09 2018   R-squared (Overall):              0.8529
Time:                                                  17:01:21   Log-likelihood                    4041.6
Cov. Estimator:                                      Unadjusted                                           
                                                                  F-statistic:                      1218.1
Entities:                                                   418   P-value                           0.0000
Avg Obs:                                                 9.3206   Distribution:                  F(4,3526)
Min Obs:                                                 0.0000                                           
Max Obs:                                                 13.000   F-statistic (robust):             1218.1
                                                                  P-value                           0.0000
Time periods:                                                28   Distribution:                  F(4,3526)
Avg Obs:                                                 139.14                                           
Min Obs:                                                 0.0000                                           
Max Obs:                                                 356.00                                           
                                                                                                          
                                                 Parameter Estimates                                                 
=====================================================================================================================
                                                   Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
---------------------------------------------------------------------------------------------------------------------
Intercept                                             0.0764     0.0106     7.2103     0.0000      0.0556      0.0971
lag(log(Y))     0.7169     0.0111     64.811     0.0000      0.6952      0.7386
I((X1))                          -0.0118     0.0043    -2.7613     0.0058     -0.0202     -0.0034
X2                                         -0.0015     0.0005    -2.7455     0.0061     -0.0025     -0.0004
X3                                             0.0004  5.189e-05     7.7485     0.0000      0.0003      0.0005
=====================================================================================================================

Syntax error in linearmodels/asset_pricing/model.py

Hi,

I am getting following error while trying to use linearmodels.

File "anaconda/lib/python2.7/site-packages/linearmodels/asset_pricing/model.py", line 119
def from_formula(cls, formula, data, *, portfolios=None):
^
SyntaxError: invalid syntax

Can't open the data sets. Data sets not present

Loading any of the data sets provided by the linearmodels package is not possible. None of the csv files are there.

BUG: Invalid Estimations in Fama Macbeth Regressions

I have noticed that I am able to get coefficient estimates for variables which vary by time but not by entity, while including a constant. As the first stage regressions are partitioned by time, the variable is a constant, and yet I get a parameter estimate.

It looks like this is happening because numpy.linalg.lstsq does not require full column rank for the exogenous matrix. I notice that FamaMacbeth inherits from PooledOLS which has a _validate_data method which checks that the exogenous variables have full column rank. However this check is only completed for the input data, which is not partitioned by time.

Each time group is passed to FamaMacbeth.fit.single to run the first stage regression. Within single, there is a check to see whether there is enough observations to run the regression if exog.shape[0] < exog.shape[1]. I believe a check to see whether the selected time exogenous matrix has full rank should be added here: if (exog.shape[0] < exog.shape[1]) or (matrix_rank(exog) < exog.shape[1]):. In my testing, this fixes the issue.

I'm not sure what your process is. I'm happy to create a PR or add an example if necessary.

SUR Tasks

How to perform wald test for selected joint coefficients?

Is there an easy way to calculate not the F-test, but instead just the hypothesis that a few coefficients jointly equal 0?

Licensing clarification

Hi - can you please clarify how this package is licensed? I see the reference to "NCSA" in at line 83 on the setup.py file.

Repository licensing: https://help.github.com/articles/licensing-a-repository/

Thanks!

Seeming Unrelated Regression weights and out of sample prediction

Hello,

I am interested in performing seemingly unrelated regression using linearmodels. I would like to be able to weight the observations in the model and obtain out-of-sample predictions. Is this possible? I have adapted the example code to begin to explore the possibilities but not progressed and I don't know if this is because what I want to do isn't possible.

import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(500, 4), columns=['y1', 'x1_1', 'y2', 'x2_1'])
weight = pd.DataFrame(np.random.randn(500, 1), columns=['weight'])
from linearmodels.system import SUR
formula = {'eq1': 'y1 ~ 1 + x1_1', 'eq2': 'y2 ~ 1 + x2_1'}
mod = SUR.from_formula(formula, data, weights=weight)
res = mod.fit(cov_type='unadjusted')
res
#data1 = pd.DataFrame(np.random.randn(500, 4), columns=['y1', 'x1_1', 'y2', 'x2_1'])
#pre = res.predict(data1)

with error message:

C:\Program Files\Anaconda3\lib\site-packages\linearmodels\system\model.py:90: UserWarning: Weights not found for equation labels:
eq1, eq2
  warnings.warn(msg, UserWarning)

Thanks.

ENH: Add predict()

https://stackoverflow.com/questions/47645280/how-to-do-predict-for-linearmodels

Add predict method or easy method to get fitted values

Add method to get fitted values with or without fixed effects, if posible

Covariance Estimates FamaMacBeth vs LinearFactorModel

Hi,

when running FamaMacBeth with inputs from a set of time series regressions analogous to the first step in the LinearFactorModel, there appears to be an issue with the covariance estimate for the second stage parameters. Comparing the two approaches I can obtain identical parameter estimates (risk premia) but there is a discrepancy in the standard errors (obviously, assuming the same cov_type/ kernel parameter is used).

For example please see below the output from first running a time series regression and then feeding the betas to the FamaMacBeth function

factor	parameter	std_error	tstat	pvalue
X_1	0.425711	0.340993	1.248444	2.119266e-01
X_2	-0.645416	0.294828	-2.189129	2.863339e-02
X_3	0.519397	0.412518	1.259088	2.080571e-01
X_4	0.476740	0.716597	0.665284	5.058993e-01
X_5	5.427486	0.821842	6.604049	4.414846e-11
X_6	-1.205829	0.346899	-3.476028	5.132095e-04

And now below the output from LinearFactorModel using the same inputs

factor	parameter	std_error	tstat	pvalue
X_1	0.425711	0.402419	1.057881	0.290161
X_2	-0.645416	0.451214	-1.430398	0.152665
X_3	0.519397	1.492575	0.347987	0.727865
X_4	0.476740	2.892115	0.164841	0.869075
X_5	5.427486	3.202529	1.694750	0.090185
X_6	-1.205829	0.665081	-1.813055	0.069883

In both cases I use cov_type='kernel' and kernel='bartlett'. As can be seen the parameter estimates agree, the errors do not. I have encountered this problem for a wide range of factors / test assets.

Any help would be much appreciated.

RLS: Release 3.0

Asset Pricing

Technical docs
Test formulas with categoricals

Panel

Test covariance estimators
Add AC estimator
Automatic bandwidth for HAC
Fix examples to use print

General

Wire up read the docs - NOT CURRENTLY POSSIBLE DUE TO TIMEOUT
Add RTD to the README
Update testing dependencies

Reproduction of the output of R code failed

I tried to reproduce the output of the following R code using linearmodels:
(Output of Script 14.1: Example-14-2.R on page 208, taken from this page or this book).

library(foreign);library(plm)
wagepan<-read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/wagepan.dta")

# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )

pdim(wagepan.p)

# Estimate FE model
summary( plm(lwage~married+union+factor(year)*educ, 
                                        data=wagepan.p, model="within") )

My code is

import pandas as pd
from linearmodels import PanelOLS

wagepan = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/wagepan.dta')
wagepan = wagepan.set_index(['nr', 'year'], drop=False)

formula = 'lwage ~  married + union + C(year)*educ -educ + EntityEffects'

result = PanelOLS.from_formula(formula, data=wagepan).fit()

However, the code generates the following error:

---------------------------------------------------------------------------
AbsorbingEffectError                      Traceback (most recent call last)
<ipython-input-6-bad630cad00e> in <module>
----> 1 result = PanelOLS.from_formula(formula, data=wagepan).fit()

~/anaconda3/lib/python3.7/site-packages/linearmodels/panel/model.py in fit(self, use_lsdv, use_lsmr, low_memory, cov_type, debiased, auto_df, count_effects, **cov_config)
   1310                 absorbed_variables = '\n'.join(rows)
   1311                 msg = absorbing_error_msg.format(absorbed_variables=absorbed_variables)
-> 1312                 raise AbsorbingEffectError(msg)
   1313 
   1314         params = lstsq(x, y)[0]

AbsorbingEffectError: 
The model cannot be estimated. The included effects have fully absorbed
one or more of the variables. This occurs when one or more of the dependent
variable is perfectly explained using the effects included in the model.

The following variables or variable combinations have been fully absorbed
or have become perfectly collinear after effects are removed:

          C(year)[1980.0], C(year)[1981.0], C(year)[1982.0], C(year)[1983.0], C(year)[1984.0], C(year)[1985.0], C(year)[1986.0], C(year)[1987.0], 
          C(year)[1980.0]:educ, C(year)[1981.0]:educ, C(year)[1982.0]:educ, C(year)[1983.0]:educ, C(year)[1984.0]:educ, C(year)[1985.0]:educ, C(year)[1986.0]:educ, C(year)[1987.0]:educ

It seems that the error is caused by the presence of C(year)[1980.0] and C(year[1980.0]:educ. In the R code above, those two variables are automatically dropped.

Is this a linearmodels problem?
Is it possible to replicate the behaviour of R, using from_formula?

Thanks for your help in advance.

linearmodels version: 4.12
patsy version : 0.5.1

Title of summary table for `IV-2SLS` and the Name of `Estimator` inside the output table

In the manual Basic Examples of Instrumental Variable Estimation, the following code

res_ols = IV2SLS(np.log(data.wage), data[['const','educ']], None, None).fit(cov_type='unadjusted')
print(res_ols)

gives the output with the titile OLS Estimation Summary and Estimator: OLS inside the table. This is a great feature.

However, If I run the following with a formula

formula = 'np.log(data.wage) ~ 1 + educ'
res_formula = IV2SLS.from_formula(formula, data).fit(cov_type='unadjusted')
print(res_formula)

the result has the title IV-2SLS Estimation Summary and Estimator: IV-2SLS. (Obviously, both codes give exactly the same result.) I wonder if there is a way to change the title to OLS Estimation Summary and the Estimator part inside the table to Estimator: OLS.

`.wu_hausman` in `IV2SLS` gives a wrong output

.wu_hausman in IV2SLS generates a table of IV2SLS results rather than test statistics, p-value, etc of the test. Can you have a look into this?

Problem handling certain variable names?

Hi. I'm not sure what's causing this issue, but I thought I might bring it up here in case it's useful. Thanks for the great package!!

When I run the following, I get an error:

import numpy as np
np.set_printoptions(suppress=True, precision=2)
import pandas as pd
import linearmodels

def generate_data(N=50):
    mean = np.array([0,0,0])
    cov = np.array([[1, 0, .5],
                    [0, 1, 0],
                    [.5, 0, 1]])
    vec = np.random.multivariate_normal(mean, cov, N)
    df = pd.DataFrame(vec, columns=['u', 'z', 'epsilon'])
    beta0 = beta1 = gamma0 = gamma1 = 1
    beta2 = 2
    df['x1'] = gamma0 + gamma1 * df.z + df.u
    df['y'] = beta0 + beta1 * df.x1 + df.epsilon
    return df

np.random.seed(100)
df = generate_data(N=100)
reg = linearmodels.IV2SLS.from_formula('y ~ 1 + [x1 ~ z]', data=df).fit()

ERROR:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-e7558e367513> in <module>()
     19 np.random.seed(100)
     20 df = generate_data(N=100)
---> 21 reg = linearmodels.IV2SLS.from_formula('y ~ 1 + [x1 ~ z]', data=df).fit()

~\Anaconda3\lib\site-packages\linearmodels\iv\model.py in from_formula(formula, data, weights)
    618         >>> mod = IV2SLS.from_formula(formula, data)
    619         """
--> 620         parser = IVFormulaParser(formula, data)
    621         dep, exog, endog, instr = parser.data
    622         mod = IV2SLS(dep, exog, endog, instr, weights=weights)

~\Anaconda3\lib\site-packages\linearmodels\iv\_utility.py in __init__(self, formula, data, eval_env)
     78         self._eval_env = eval_env
     79         self._components = {}
---> 80         self._parse()
     81 
     82     def _parse(self):

~\Anaconda3\lib\site-packages\linearmodels\iv\_utility.py in _parse(self)
     99                 raise ValueError('endogenous block must not start or end with +. This block '
    100                                  'was: {0}'.format(endog))
--> 101             if instr[0] == '+' or instr[1] == '+':
    102                 raise ValueError('instrument block must not start or end with +. This '
    103                                  'block was: {0}'.format(instr))

IndexError: string index out of range

If I change the variable name, then the error goes away:

df['z1'] = df.z
reg = linearmodels.IV2SLS.from_formula('y ~ 1 + [x1 ~ z1]', data=df).fit()
reg

I'm using Python 3.6 on Anaconda. Here is the version info of linearmodels and pandas:

λ pip show linearmodels
Version: 4.7

λ pip show pandas
Version: 0.22.0

Using scipy's genetic algorithm for initial parameter estimation

I have used scipy's Differential Evolution genetic algorithm to find initial parameters for fitting a double Lorentzian peak equation to Raman spectroscopy data. I found that the results were excellent, and see from your GitHub project that you might find my project to be interesting.

The GitHub project, with a test spectroscopy data file, is:

https://github.com/zunzun/RamanSpectroscopyFit

If you have any questions, please let me know.

James Phillips

Memory Efficient Estimation

I'd greatly appreciate an option within model.fit() that allowed to turn off storing the residuals, forecasted values etc. This might come in handy for data heavy applications in which one is only interested in the fitted coefficients & statistics, but not in the residuals etc.

PanelOLS.from_formula performance issues

Getting some unexpected performance issues when I use PanelOLS.from_formula on a dataframe that has lots of variables. Problems are exacerbated if there are non-numeric types in the dataframe (even if unused). Notice if I do the patsy calls manually, performance is much better (including the timing of the patsy call).

import pandas as pd, numpy as np, time, patsy
from linearmodels import PanelOLS
cols = [s+(str(x) if x>0 else '') for s in list('ABCD') for x in range(100)]
n = 1000
t= 50
df = pd.DataFrame(np.random.uniform(1.0,100.0,size=(n*t, len(cols))), columns=cols)
df.index = pd.MultiIndex.from_arrays(np.mgrid[0:n,0:t].reshape(2,n*t),names=['ID','YEAR'])
df['Y'] = df['A']+df['B']+df['C']+df['D']
df['A1'] = 'str'
df['D99'] = 'str'

# Run 1 - from_formula on full dataframe
t=time.time()
f1 = PanelOLS.from_formula('Y~A+C+D+EntityEffects', df).fit()
print('from_formula with strings in dataframe',time.time()-t)

#Run 2 - Call patsy manually on full dataframe
t=time.time()
e,x = patsy.dmatrices('Y~A+C+D',df,return_type='dataframe')
f2=PanelOLS(e, x, entity_effects=True).fit()
print('PanelOLS with strings in dataframe',time.time()-t)

#Run 3 - Drop the unused strings from the dataframe. Helps a bit but still slower than run #2.
del df['A1']
del df['D99']
t=time.time()
f3 = PanelOLS.from_formula('Y~A+C+D+EntityEffects', df).fit()
print('from_formula without strings in dataframe',time.time()-t)

print(f1)
print(f2)
print(f3)

Here's the first three lines of output showing the timings:

from_formula with strings in dataframe 1.1370854377746582
PanelOLS with strings in dataframe 0.09599709510803223
from_formula without strings in dataframe 0.5590300559997559

Is it able to change T-stats reported in parentheses to std?

When using compare commands, is it possible to show std or p-value rather than T-stats in parentheses?
And really hope output to latex can be available in the future.
Thanks.

scipy's factorial function no longer in misc

The factorial function is no longer in scipy.misc. as of scipy 1.3.0. It has lived in scipy.special since scipy 1.0.0, but the alias in scipy.misc was removed in 1.3.0. This causes the tests to fail.

R-squared in PanelOLS

Using PanelOLS, I get different R-squared's than those produced in statistical software like STATA or ols in the statsmodels package. The reason might be that PanelOLS seems to return R-squared by removing fixed effects even if the model includes FEs (correct me if I'm wrong). In economics papers, this is perhaps less common. Is there any way to produce the R-squared without removing FEs?

Open Tasks

Asset Pricing

Asset pricing DRY

System

HAC Covariance

Panel

Test panel covariance estimators
Add automatic bandwidth to all HAC estimators

General

Convert smoke tests to actual tests

Standard errors of the estimators when IV2SLS is used as OLS

I compared the results of Wages of Married Women at this page, using linearmodels, statsmodels and R. Naturally, all of them give the same estimates of the const and educ. However, the standard errors of the estimators in linearmodels are (slightly) differenit from those in statsmodels and R, which give basically the identical values (after rounding). I also "manually" calculated them in each case of linearmodels and statsmodels. The results are identical, and basically the same as in statsmodels. It seems to suggest that there is something "funny" in the standard errors of the estimators linearmodels. I wonder if you can help me in understanding those results.

Standard Errors of const
linearmodels: 0.184793
statsmodelss: 0.185226
R: 0.18522590

Standard Errors of educ
linearmodels: 0.014366
statsmodelss: 0.014400
R: 0.01439985

Python Code

import numpy as np
from linearmodels.iv import IV2SLS
from linearmodels.datasets import mroz
from statsmodels.formula.api import ols
from statsmodels.api import add_constant

data = mroz.load().dropna()
data = add_constant(data, has_constant='add')

# linearmodels -------
formula_lm = 'np.log(wage) ~ 1+ educ'
result_lm = IV2SLS.from_formula(formula_lm, data).fit(cov_type='unadjusted')
print(result_lm.std_errors)

# Statsmodels -------
formula_sm = 'np.log(wage) ~ educ'
result_sm = ols(formula_sm, data).fit()
print(result_sm.bse)

Manual Calculations

# linearmodels -------------------------------------
# no of observations
n_lm = result_lm.nobs

# residuals
resid_lm = result_lm.resids

# Standard Erros of Regression
SER_lm = np.std(resid_lm, ddof=1) * np.sqrt((n_lm - 1) / (n_lm - 2))

# Standard Errors of Intercept
SE_const_lm = SER_lm / np.std(data.educ, ddof=1) / \
    np.sqrt(n_lm - 1) * np.sqrt(np.mean(data.educ**2))
print(SE_const_lm)

# Standard Errors of educ
SE_educ_lm = SER_lm / np.std(data.educ, ddof=1) / np.sqrt(n_lm - 1)
print(SE_educ_lm)

# statsmodels -------------------------------------
# no of observations
n_sm = result_sm.nobs

# residuals
resid_sm = result_sm.resid

# Standard Erros of Regression
SER_sm = np.std(resid_sm, ddof=1) * np.sqrt((n_sm - 1) / (n_sm - 2))

# Standard Errors of Intercept
SE_const_sm = SER_sm / np.std(data.educ, ddof=1) / \
    np.sqrt(n_sm - 1) * np.sqrt(np.mean(data.educ**2))
print(SE_const_sm)

# Standard Errors of educ
SE_educ_sm = SER_sm / np.std(data.educ, ddof=1) / np.sqrt(n_sm - 1)
print(SE_educ_sm)

Singleton Observations in Fixed Effect Models

Does linearmodels account for singleton observations in fixed effect models?

I am familiar with the reghdfe Stata command, but I'm trying to make the full switch to Python. The author of that package also has a brief paper discussing the issue.

As I compare the results of the two implementations, they start off identical and gradually diverge with more complicated models. I'm not entirely sure singleton observations are to blame for the differences, but it seems plausible.

PanelOLS with two-way fixed effects / double clustered std. errors for a panel of millions of observations

Hi, I have come across a performance issue in regards to two-way fixed effects and double clustered standard errors. So essentially I would like to use both entity and time effects as well as double clustered errors in PanelOLS. My panel is unbalanced with about 10 million observations in total, ~5000 entities, ~2000 time steps and say about half a dozen independent variables.

When running this setup in PanelOLS I keep having trouble with the Python process memory blowing up, ultimately resulting in the Python process being killed by my machine. I suppose this should be replicable using other data sets. For now, please see below a minimal example with a generic set of data available here (sorry for the large file, but this appears to be a large data issue):
https://www.dropbox.com/s/u9mzwdlyqp1ylu3/LargePanel.csv?dl=0

import pandas as pd

from linearmodels import PanelOLS

data = pd.read_csv('LargePanel.csv')
data['Time'] = pd.to_datetime(data['Time'])
data.set_index(['EntityNo', 'Time'], inplace=True)


mod = PanelOLS(data['Y1'], data[['X1', 'X2', 'X3', 'X4']],
               entity_effects=True, time_effects=True)
res1 = mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)

I would be interested to learn whether there is a way to make the procedure more memory / computationally efficient.

Panel regression with lagged dependent variables

It looks like many of the needed pieces (esp. GMM IV) are in place to do Arellano-Bond style regressions with a lagged dependent variable, but it doesn't look like it is actually implemented. If that's the case do you have thoughts on how to implement it?

IVLIML: kappa with no exog

If IVLIML is fit with no exogenous data, the eigenvalue 'kappa' is always set to 1 (ie, reduces to 2SLS). This looks to be done on purpose right now, but LIML can still be estimated with no exogenous data. Of course typically a constant is included, but in cases where one explicitly does not want a constant it would be nice to still be able to estimate LIML.

Is there a reason the LIML kappa estimation is set up this way? If not, seems like you could just get rid of the check for no exogenous regressors in the _estimate_kappa function and keep everything else as is.

Linearmodels 4.5 PanleOLS and PooledOLS prediction issue

Hello everyone,
Could anyone please help me with this issue I have regarding to the prediction of the PanelOLS model I am using (in python) for this sample data. The error is caused from the source code I suppose. I've attached the codes I used and the error I keep getting below.
Thanks in advance for any helps.
Milad.

----beginning of the code:
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)

MultiIndex, entity - time

data = data.set_index(['firm','year'])
from linearmodels import PanelOLS
mod = mod = PanelOLS.from_formula('invest ~ 1 + value + capital +EntityEffects', data)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res)
predicted = mod.predict(['value','capital'],data = data[:50] , exog = None)
print(predicted)
#plt.scatter(data.invest,predicted.fitted_values)
--- end

---Error:

TypeError Traceback (most recent call last)
in ()
----> 1 predicted = mod.predict(['value','capital'],data = data[:50] , exog = None)
2 print(predicted)
3 #plt.scatter(data.invest,predicted.fitted_values)
4 #plt.scatter

C:\ProgramData\Anaconda3\lib\site-packages\linearmodels\panel\model.py in predict(self, params, exog, data, eval_env)
683 if params.shape[0] == 1:
684 params = params.T
--> 685 pred = pd.DataFrame(x @ params, index=exog.index, columns=['predictions'])
686
687 return pred

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
---- End

Poolability test in PoolOLS

Kevin, I was wondering if linearmodels (or really anything else) implements the standard Chow test for poolability as in http://support.sas.com/documentation/cdl/en/etsug/66840/HTML/default/viewer.htm#etsug_panel_details41.htm.

I see that FixedEffects model has an F-test for effects but was looking for testing whether pooled ols is appropriate or not.

Thanks for writing this wonderful package. It plugs a big hole in statsmodel currently!

Docs Page is not rendering

The docs page is not rendering. I am getting a 404: https://bashtage.github.io/linearmodels/doc

API consistency with statsmodels

I noticed a few inconsistencies with how statsmodels works. For example, the results summary is a method in SM but a property in LM. Is this intentional?

I also see that predict method is missing, but it is implemented for many SM models. Are there plans to implement a predict function? Handling fixed effects (and lagged dependent variables if implemented) would be nice.

FamaMacBeth Std. Errors / T-Stats

I have encountered an issue pertaining to the computation of standard errors and measures dependent on them such as t-stats and p-values in the FamaMacBeth function. On some occasions the function will produce a parameter estimate, but no error statistics. More specifically, I receive a runtime warning of the following sort:

..../linearmodels/panel/results.py:70: RuntimeWarning: invalid value encountered in sqrt

Any suggestions are very welcome :)

Multicollinearity in Panel Data in Python

Hi, I'm new to github, so pardon me if this is not the venue to suggest things for future releases, but I think it would be incredibly useful for future users of this package to have a way to run models in PanelOLS or in whatever estimation strategy they want, where the solver automatically drops dummies (and notifies you) when there is perfect multicollinearity in the regressors. This is how Stata operates, and I imagine R has similar functionality. For an example using the data that you provide as test data, I attempt to run a PanelOLS on a fairly simple model, but instead of using expersq I use exper. This clearly results in some singularity in the resulting matrix.

# Load the test data
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
import pandas as pd
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(['nr', 'year'])
data['year'] = year
print(wage_panel.DESCR)
print(data.head())

# Run the regression
from linearmodels.panel import PanelOLS
exog_vars = ['exper','union','married']
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)

Also, you can see a version where I recreate how Stata would estimate the exact same model here: https://stackoverflow.com/questions/55071706/multicollinearity-in-panel-data-in-python

Remaining tasks for SystemGMM and 3SLS

Add GMM Estimation for 3SLS

GMM Estimation of IV System models can have some advantages over 3SLS when instruments are not common.

Panel Tasks

Inconsistent Results of Fixed Effects (PanelOLS)

First of all, very much appreciate this great package for panel data analysis in Python!

I'm doing panel regression with fixed effects of entity and time. By using the same dataset and running through R, Python Statsmodels, and Python linearmodels (this package), the results of R and Statsmodels (using dummies) are consistent, but they both differ from linearmodels (PanelOLS, I've tried unadjusted, robust, etc.). Why?

I'm a beginner of econometrics, guess you may need more information about the analysis (or maybe I get wrong!). Or we may experiment on a public dataset. Let me know how we can help to improve the robustness of this package. Thanks!

Could linearmodels handle stream data?

When I use PanelOLS do regression like:

mod = PanelOLS(data[depv], data[indv])

I get the error about insufficient memory.

Then I would like to use Spark to handle the problem.

But I do not know whether linearmodels could handle stream data.

Work items for panel

SUR with three-dimensional exog

Thanks @bashtage for introducing me to SUR and linearmodels over at stackoverflow.

I'm having a bit of trouble with using the SUR model in the way I would like to, and I hope you can help. I'm sorry that this issue has become quite long. I'm happy to ask elsewhere, but if you have the time, I'd be very happy to hear your thoughts.

I'm trying to fit a number of linear components to a number of spectra. Each spectrum is a one dimensional array (typically of length 1024) that has been taken by a microscope. The microscope does this across a number of pixels (lets say a grid of 64 by 64) on a sample. So my raw data, my dependent variable has numpy shape (64,64,1024). Later on in this post I'll take the product of the pixels, changing the shape into (64*64, 1024).

My components can be a variety of things, but for the sake of argument, let's say I'm trying to fit a second order polynomial as a background. So we have three components, x**2, x and 1. I compute each one for the spectrum which results in three arrays, or a ndarray of shape (3,1024).

Normally, I would now either (functions timed with jupyter notebook):
A) loop over the 64x64 pixels, and run np.linalg.lstsq() on each spectrum:

%%timeit
rawdata = np.random.random((64,64,1024)).reshape(64*64,1024)
components = np.random.random((3,1024))

for i in range(rawdata.shape[0]):
    np.linalg.lstsq(comps.T, rawdata[i])
664 ms ± 25.7 ms per loop

or B) use the "not-really-designed-for-this" OLS:

%%timeit
rawdata = np.random.random((64,64,1024)).reshape(64*64,1024)
components = np.random.random((3,1024))

sm.OLS(rawdata.T, components.T)
57 ms ± 1.78 ms per loop

OLS is much faster, especially when I increase the number of pixels. That's great.

Unfortunately, my problem is ideally a little bit more complicated. In addition to simple components like x**2, we often want to use Gaussians or similar expressions, but where all the parameters are fixed. However, these components will not always be equal in every pixel. In some pixels the component has been pre-fitted in some manner - perhaps the centre of the gaussian is shifted a little to the right or to the left.

This means that I have increased the number of dimensions of my components so my new component shape is now (64,64,3,1024). Around now I think you'll be seeing my problem. Trying to feed my component data as the exog variable, I get the ValueError: exog_0 has too many dims. Maximum is 2, actual is 3.

This leads me to think that I've got the wrong model, or at least a problem which is not suitable to be solved in this manner.

I should add that I wasn't able to fit the first case (like I do with numpy and statsmodels) with SUR - I'm not sure where I go wrong, but it's definitely my syntax:

from linearmodels import SUR
import numpy as np

rawdata = np.random.random((64,64,1024)).reshape(64*64,1024)
components = np.random.random((3,1024))

equations = {
    'test':{
        'dependent':rawdata,
        'exog':comps
    }
}

SUR(equations)
ValueError: Array required to have 4096 obs, has 3

Thanks for reading. Sorry that the post became so long, but I feared giving too little information, and since we work in very different fields and with very different "languages", I thought it better to fully explain. If you have any suggestions or advice, I'll gladly take them.

How to get model parameters of linearmodels?

https://stackoverflow.com/questions/47645349/how-to-get-model-parameters-of-linearmodels

BUG: Incorrect weight calculationfor Parzen/Gallant

Wrongly placed parentheses in parzen weights. Was

w[z > 0.5] = 2 * (1 - z[z > 0.5] ** 3)

should be

w[z > 0.5] = 2 * (1 - z[z > 0.5]) ** 3

From #168

Clustered variance on time ... something is not working right ...

Hi -

I was trying to replicate some control results regarding the calculation of clustered standard errors in the PanelOls function in the Linearmodels library. Specifically, I was replicating the control results created by Mitchell A. Petersen, made public here:

https://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.htm

Everything looks fine, but the results from the regression using standard errors clustered by time is a bit off, compared to the other results which are fine.

Can you please double check if you have the right calculations for standard errors clustering on time, please.

I assume that I am doing the right things as I was able to replicate all other regression results.

Running Python 3 and just downloaded LinearModels 4.10 on my Windows computer.

Thank you in advance.

Kind regards, Jesper.

biprobit or similar for binary endogenous explanatory variable

Hi,

I've asked the same question on pystatsmodels forum, sorry if it is inconvenient to use the issue tracker here for the purpose of this question.

I have a model similar to what is described in Wooldridge (econometric analysis of cross section and panel data 2nd ed.) 15.7.3 (eqs. 15.51 and 15.52). There it is said that a bivariate probit model can be used to estimate parameters. Is this or something similar implemented in linearmodels, or are you planning to implement it? e.g. systems probit/logit

Best

bashtage / linearmodels Goto Github PK

linearmodels's Introduction

Linear Models

Panel models

Instrumental Variable Models

Installing

Documentation

Plan and status

Requirements

Running

Testing

Documentation

linearmodels's People

Contributors

Stargazers

Watchers

Forkers

linearmodels's Issues

MultiIndex, entity - time

---Error:

Recommend Projects

Recommend Topics

Recommend Org