Giter Site home page Giter Site logo

s3alfisc / pyfixest Goto Github PK

View Code? Open in Web Editor NEW
93.0 4.0 13.0 13.85 MB

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax

Home Page: https://py-econometrics.github.io/pyfixest/pyfixest.html

License: MIT License

Python 74.60% R 0.94% Jupyter Notebook 24.20% Just 0.26%

pyfixest's Introduction

PyFixest: Fast High-Dimensional Fixed Effects Regression in Python

PyPI - Version PyPI - Python Version PyPI - Downloads image All Contributors

PyFixest is a Python implementation of the formidable fixest package for fast high-dimensional fixed effects regression.

The package aims to mimic fixest syntax and functionality as closely as Python allows: if you know fixest well, the goal is that you won't have to read the docs to get started! In particular, this means that all of fixest's defaults are mirrored by PyFixest - currently with only one small exception.

Nevertheless, for a quick introduction, you can take a look at the documentation or the regression chapter of Arthur Turrell's book on Coding for Economists.

Features

  • OLS, WLS and IV Regression
  • Poisson Regression following the pplmhdfe algorithm
  • Multiple Estimation Syntax
  • Several Robust and Cluster Robust Variance-Covariance Estimators
  • Wild Cluster Bootstrap Inference (via wildboottest)
  • Difference-in-Differences Estimators:
  • Multiple Hypothesis Corrections following the Procedure by Romano and Wolf and Simultaneous Confidence Intervals using a Multiplier Bootstrap
  • Fast Randomization Inference as in the ritest Stata package
  • The Causal Cluster Variance Estimator (CCV) following Abadie et al.

Installation

You can install the release version from PyPi by running

pip install -U pyfixest

or the development version from github by running

pip install git+https://github.com/py-econometrics/pyfixest.git

Benchmarks

All benchmarks follow the fixest benchmarks. All non-pyfixest timings are taken from the fixest benchmarks.

Quickstart

import pyfixest as pf

data = pf.get_data()
pf.feols("Y ~ X1 | f1 + f2", data=data).summary()
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.919 |        0.065 |   -14.057 |      0.000 | -1.053 |  -0.786 |
---
RMSE: 1.441   R2: 0.609   R2 Within: 0.2

Multiple Estimation

You can estimate multiple models at once by using multiple estimation syntax:

# OLS Estimation: estimate multiple models at once
fit = pf.feols("Y + Y2 ~X1 | csw0(f1, f2)", data = data, vcov = {'CRV1':'group_id'})
# Print the results
fit.etable()
                           est1               est2               est3               est4               est5               est6
------------  -----------------  -----------------  -----------------  -----------------  -----------------  -----------------
depvar                        Y                 Y2                  Y                 Y2                  Y                 Y2
------------------------------------------------------------------------------------------------------------------------------
Intercept      0.919*** (0.121)   1.064*** (0.232)
X1            -1.000*** (0.117)  -1.322*** (0.211)  -0.949*** (0.087)  -1.266*** (0.212)  -0.919*** (0.069)  -1.228*** (0.194)
------------------------------------------------------------------------------------------------------------------------------
f2                            -                  -                  -                  -                  x                  x
f1                            -                  -                  x                  x                  x                  x
------------------------------------------------------------------------------------------------------------------------------
R2                        0.123              0.037              0.437              0.115              0.609              0.168
S.E. type          by: group_id       by: group_id       by: group_id       by: group_id       by: group_id       by: group_id
Observations                998                999                997                998                997                998
------------------------------------------------------------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)

Adjust Standard Errors "on-the-fly"

Standard Errors can be adjusted after estimation, "on-the-fly":

fit1 = fit.fetch_model(0)
fit1.vcov("hetero").summary()
Model:  Y~X1
###

Estimation:  OLS
Dep. var.: Y
Inference:  hetero
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.223 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.082 |   -12.134 |      0.000 | -1.162 |  -0.838 |
---
RMSE: 2.158   R2: 0.123

Poisson Regression via fepois()

You can estimate Poisson Regressions via the fepois() function:

poisson_data = pf.get_data(model = "Fepois")
pf.fepois("Y ~ X1 + X2 | f1 + f2", data = poisson_data).summary()
###

Estimation:  Poisson
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.007 |        0.035 |    -0.190 |      0.850 | -0.075 |   0.062 |
| X2            |     -0.015 |        0.010 |    -1.449 |      0.147 | -0.035 |   0.005 |
---
Deviance: 1068.169

IV Estimation via three-part formulas

Last, PyFixest also supports IV estimation via three part formula syntax:

fit_iv = pf.feols("Y ~ 1 | f1 | X1 ~ Z1", data = data)
fit_iv.summary()
###

Estimation:  IV
Dep. var.: Y, Fixed effects: f1
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.025 |        0.115 |    -8.930 |      0.000 | -1.259 |  -0.790 |
---

Call for Contributions

Thanks for showing interest in contributing to pyfixest! We appreciate all contributions and constructive feedback, whether that be reporting bugs, requesting new features, or suggesting improvements to documentation.

If you'd like to get involved, but are not yet sure how, please feel free to send us an email. Some familiarity with either Python or econometrics will help, but you really don't need to be a numpy core developer or have published in Econometrica =) We'd be more than happy to invest time to help you get started!

Contributors โœจ

Thanks goes to these wonderful people:

styfenschaer
styfenschaer

๐Ÿ’ป
Niall Keleher
Niall Keleher

๐Ÿš‡ ๐Ÿ’ป
Wenzhi Ding
Wenzhi Ding

๐Ÿ’ป
Apoorva Lal
Apoorva Lal

๐Ÿ’ป ๐Ÿ›
Juan Orduz
Juan Orduz

๐Ÿš‡ ๐Ÿ’ป
Alexander Fischer
Alexander Fischer

๐Ÿ’ป ๐Ÿš‡
aeturrell
aeturrell

โœ… ๐Ÿ“– ๐Ÿ“ฃ

This project follows the all-contributors specification. Contributions of any kind welcome!

pyfixest's People

Contributors

allcontributors[bot] avatar apoorvalal avatar dependabot[bot] avatar juanitorduz avatar nkeleher avatar pre-commit-ci[bot] avatar s3alfisc avatar styfenschaer avatar wenzhi-ding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pyfixest's Issues

p-value and some t-stat calculations differ from R fixest

Some more toying around, and I'm getting different p-value calculations from the original fixest, even in cases where estimates and t-stats match:

With clustering, the t-stats match but p-values are different:

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'Year'}))
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: inspection_score 
# 
#            Estimate  Std. Error    t value  Pr(>|t|)
# Intercept 93.627262    0.321288 291.412047  0.000000
#   Weekend  2.096548    0.862544   2.430656  0.015072
library(fixest)
library(causaldata)

data(restaurant_inspections)

feols(inspection_score ~ Weekend , data = restaurant_inspections, vcov = ~Year)

# OLS estimation, Dep. Var.: inspection_score
# Observations: 27,178 
# Standard-errors: Clustered (Year) 
# Estimate Std. Error   t value  Pr(>|t|)    
# (Intercept) 93.62726   0.321288 291.41205 < 2.2e-16 ***
#   WeekendTRUE  2.09655   0.862544   2.43066  0.029106 *  
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 6.25408   Adj. R2: 8.241e-4

With IID and HC1, the t-stats and p-values differ by very small amounts

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'iid')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036224 55497.059216  0.000000
# Weekend   -0.862863    0.412097    -2.093833  0.036275
feols(Year~Weekend, data = restaurant_inspections)
OLS estimation, Dep. Var.: Year
Observations: 27,178 
Standard-errors: IID 
               Estimate Std. Error     t value  Pr(>|t|)    
(Intercept) 2010.343815   0.036226 55495.01719 < 2.2e-16 ***
WeekendTRUE   -0.862863   0.412112    -2.09376  0.036291 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 5.94874   Adj. R2: 1.245e-4
fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'HC1')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036181 55562.889684  0.000000
# Weekend   -0.862863    0.471032    -1.831856  0.066973
feols(Year~Weekend, data = restaurant_inspections, vcov = 'hc1')
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: Heteroskedasticity-robust 
# Estimate Std. Error     t value  Pr(>|t|)    
# (Intercept) 2010.343815   0.036182 55561.86747 < 2.2e-16 ***
#   WeekendTRUE   -0.862863   0.471041    -1.83182  0.066989 .  

Unit Tests

  • test feols front end against fixest::feols
  • test parsing of formula syntax
  • add continuous integration
  • add even more tests

Bug for interactions

KeyError: "['X1:X2'] not in index". Happens because variable "X1:X2" is not contained in the input data. Likely requires even more reshuffling of the formula parsing - demeaning - model matrix pipeline.

Support for datetime variables

Was inspired to try using a datetime in a regression, given that statsmodels handles these incorrectly. Currently these do not appear to be supported in pyfixest

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

res['DT'] = [pd.to_datetime(str(y)+'-01-01 00:00:00') for y in res['Year']]

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ DT', vcov = 'iid')
fixest.summary()
# UFuncTypeError: ufunc 'multiply' cannot use operands with types dtype('int32') and dtype('<M8[ns]')

Note that this also occurs without the 00:00:00, or if using a datetime.date() instead of a Pandas datetime. Compare to R:

library(fixest)
library(causaldata)

data(restaurant_inspections)

restaurant_inspections$Time = lubridate::ymd_hms(paste0(restaurant_inspections$Year, '-01-01 00:00:00'))

feols(Year~Time, data = restaurant_inspections)
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: IID 
# Estimate   Std. Error  t value  Pr(>|t|)    
# (Intercept) 1.970001e+03 3.323279e-05 59278828 < 2.2e-16 ***
#   Time        3.170000e-08 2.580000e-14  1226887 < 2.2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 7.994e-4   Adj. R2: 1

That's probably enough tinkering for me today! Sorry to load up the Issue page.

Tests

  • compare against fixest
  • internal benchmarks

vcov() method does not update summary info

Example:

import pyfixest as pf
import numpy as np
from pyfixest.utils import get_data

fixest = pf.Fixest(data = data)
fixest.feols("Y~X1 | csw0(X2, X3)", vcov = {'CRV1':'id'})
fixest.summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.220649 30.130262   0.00000
#        X1 -0.141200    0.211081 -0.668937   0.50369
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274    0.210556 -0.675707  0.499383
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.096317    0.204801 -0.470296  0.638247
fixest.vcov({'CRV3':'group_id'}).summary()
>>> fixest.vcov({'CRV3':'group_id'}).summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.229614 28.953831  0.000000
#        X1 -0.141200    0.207516 -0.680428  0.502745
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274     0.20774 -0.684867   0.49999
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error  t value  Pr(>|t|)
# X1 -0.096317    0.206282 -0.46692  0.644768
# 

CRV3 inference

  • currently blocked by default -> unblock
  • update with code from statsmodels PR

Add requirements.txt

Currently there is no indication of what dependencies are required for the package. At the moment I'm repeatedly trying to load Fixest and installing the missing packages it notices one by one.

Implement CRV3 for arbitrary fixed effects

Should be fairly straightforward: for no clustering, run MNW's summclust algo, else do what sandwich does. E.g. do along these lines:

                if self.has_fixef == False:
                    # inverse hessian precomputed?
                    tXX = np.transpose(self.X) @ self.X
                    tXy = np.transpose(self.X) @ self.Y

                    # compute leave-one-out regression coefficients (aka clusterjacks')
                    for ixg, g in enumerate(clusters):

                        Xg = self.X[np.equal(ixg, group)]
                        Yg = self.Y[np.equal(ixg, group)]
                        tXgXg = np.transpose(Xg) @ Xg
                        # jackknife regression coefficient
                        beta_jack[ixg,:] = (
                            np.linalg.pinv(tXX - tXgXg) @ (tXy - np.transpose(Xg) @ Yg)
                        ).flatten()

                else:

                    for ixg, g in enumerate(clusters):
                        data = self.data[np.equal(ixg, group)]
                        model = Fixest(data)
                        model.feols(self.formula, vcov = "iid")
                        beta_jack[ixg,:] = model.beta_hat

Issues with specifying vcov

  1. Currently, vcov is a required positional argument in feols. Note this also means you can't specify the vcov using the .vcov method without also specifying it in the feols call as a positional argument. Suggest matching the defaults from the R package of vcov = 'iid' with no fixed effects specified, or clustered at the level of the first fixed effect if fixed effects are specified., or overwriting that with whatever was set in .vcov.
  2. help(fixest.feols) suggests using dict("CRV1":"clustervar") to specify a cluster variable but this is improper syntax.
  3. In my example data I'm having trouble setting any sort of clustering variable alongside a set of fixed effects (pip install causaldata; note there are no missing values in this data):
from causaldata import restaurant_inspections
import pandas as pd
import pyfixest as pf
import numpy as np

res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)

fixest.feols('inspection_score ~ Weekend | business_name', vcov = 'iid')
# runs fine

fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

# maybe it prefers numbers?
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'Year'}))
# IndexError: index 27134 is out of bounds for axis 0 with size 27076

# Or integers?
res['random_category'] = np.random.randint(0, 10, res.shape[0])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'random_category'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# or categoricals?
res['categorical'] = pd.Categorical(res['random_category'])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'categorical'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# Works fine without the FEs
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'categorical'}))

# although still not for business_name
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


# Maybe it can't handle single-row FEs alongside clusters?
res['counts'] = (res
		   .groupby('business_name')['business_name']
		   .transform('size'))
fixest = pf.Fixest(data = res.query('counts > 1'))
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# ValueError: matrix should have the same number of rows as fixed effect IDs.

Bug in i() method

For some reason, signs of integers are swapped, i.e. for event studies with time-to-treatment (ttt) -27, ...., 0, ..., 21, the estimated coefs are labeled as -21, ..., 0, ..., 27.

Custom fixed effects demeaning function

  • Write a custom demean() function that allows for NA values in either X or Y
  • further, allow for weights (for WLS)
  • implement dropping of singletons
  • note that there are a few errors in the current version of demean() (i.e. the handling of missings + multiple fixed effects)

Enable cluster jackknife inference

Fairly straightforward:

  • simply add clustering variant CRV3-jackknife, e.g. as function argument {'CRV3-jackknife':'group_id'}
  • uncomment code here
  • add tests, done!

Mimic fixest vcov defaults

  • drop requirement to set vcov argument in .feols()
  • for no specified vcov without fixed effects: iid inference
  • for no specified vcov with fixed effects: CRV1 clustered at the first fixed effects level

See here.

Handling of Categorical Variables

  • Build around formulaic's C(..., levels = ...) option.
  • Need to pass levels information to formula.
  • Rename levels to "ref" so that users can only provide a reference str instead of a full list.
  • Is there a need for a dedicated i() option to interact with a categorical variable? The key advantage of the i() method seems custom tooling around it, e.g. iplot. But that could be handled with plotting method with decent string regex for coef names (i.e. for variables included in i()). Maybe not 100% error prone?

IV support

Required Steps:

  • allow three-part formulas
  • implement IV estimation & inference. Start with just-identified models.
  • update summary() and tidy() methods. For tidy(), simply add index "stage1" and "stage2". Split summary() into two parts. Check out what fixest does.

Nice to have's (for now):

  • common IV diagnostics
  • AR confidence intervals

In Practice:

  • implement IV estimator (Z'X)^(-1) Z'Y.
  • always create X and Z. For OLS, set X = Z
  • pass everything through, done (at least for models with only one endog variable)
  • after this is implemented, allow for over identified models. This requires implementation of the 2SLS estimator + updates to inference procedures

Improve performance

  • See here for faster solving of the least squares problem.
  • faster demeaning algo via numba / c++
  • clean up code, drop redundancies, etc
  • for a given column in the design matrix, only project out fe's when not already done (for multiple estimation)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.