s3alfisc / pyfixest Goto Github PK

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax

Home Page: https://py-econometrics.github.io/pyfixest/pyfixest.html

License: MIT License

Python 74.60% R 0.94% Jupyter Notebook 24.20% Just 0.26%

pyfixest's Introduction

PyFixest: Fast High-Dimensional Fixed Effects Regression in Python

PyFixest is a Python implementation of the formidable fixest package for fast high-dimensional fixed effects regression.

The package aims to mimic fixest syntax and functionality as closely as Python allows: if you know fixest well, the goal is that you won't have to read the docs to get started! In particular, this means that all of fixest's defaults are mirrored by PyFixest - currently with only one small exception.

Nevertheless, for a quick introduction, you can take a look at the documentation or the regression chapter of Arthur Turrell's book on Coding for Economists.

Features

OLS, WLS and IV Regression
Poisson Regression following the pplmhdfe algorithm
Multiple Estimation Syntax
Several Robust and Cluster Robust Variance-Covariance Estimators
Wild Cluster Bootstrap Inference (via wildboottest)
Difference-in-Differences Estimators:
- The canonical Two-Way Fixed Effects Estimator
- Gardner's two-stage ("Did2s") estimator
- Basic Versions of the Local Projections estimator following Dube et al (2023)
Multiple Hypothesis Corrections following the Procedure by Romano and Wolf and Simultaneous Confidence Intervals using a Multiplier Bootstrap
Fast Randomization Inference as in the ritest Stata package
The Causal Cluster Variance Estimator (CCV) following Abadie et al.

Installation

You can install the release version from PyPi by running

pip install -U pyfixest

or the development version from github by running

pip install git+https://github.com/py-econometrics/pyfixest.git

Benchmarks

All benchmarks follow the fixest benchmarks. All non-pyfixest timings are taken from the fixest benchmarks.

Quickstart

import pyfixest as pf

data = pf.get_data()
pf.feols("Y ~ X1 | f1 + f2", data=data).summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.919 |        0.065 |   -14.057 |      0.000 | -1.053 |  -0.786 |
---
RMSE: 1.441   R2: 0.609   R2 Within: 0.2

Multiple Estimation

You can estimate multiple models at once by using multiple estimation syntax:

# OLS Estimation: estimate multiple models at once
fit = pf.feols("Y + Y2 ~X1 | csw0(f1, f2)", data = data, vcov = {'CRV1':'group_id'})
# Print the results
fit.etable()

                           est1               est2               est3               est4               est5               est6
------------  -----------------  -----------------  -----------------  -----------------  -----------------  -----------------
depvar                        Y                 Y2                  Y                 Y2                  Y                 Y2
------------------------------------------------------------------------------------------------------------------------------
Intercept      0.919*** (0.121)   1.064*** (0.232)
X1            -1.000*** (0.117)  -1.322*** (0.211)  -0.949*** (0.087)  -1.266*** (0.212)  -0.919*** (0.069)  -1.228*** (0.194)
------------------------------------------------------------------------------------------------------------------------------
f2                            -                  -                  -                  -                  x                  x
f1                            -                  -                  x                  x                  x                  x
------------------------------------------------------------------------------------------------------------------------------
R2                        0.123              0.037              0.437              0.115              0.609              0.168
S.E. type          by: group_id       by: group_id       by: group_id       by: group_id       by: group_id       by: group_id
Observations                998                999                997                998                997                998
------------------------------------------------------------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)

Adjust Standard Errors "on-the-fly"

Standard Errors can be adjusted after estimation, "on-the-fly":

fit1 = fit.fetch_model(0)
fit1.vcov("hetero").summary()

Model:  Y~X1
###

Estimation:  OLS
Dep. var.: Y
Inference:  hetero
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.223 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.082 |   -12.134 |      0.000 | -1.162 |  -0.838 |
---
RMSE: 2.158   R2: 0.123

Poisson Regression via `fepois()`

You can estimate Poisson Regressions via the fepois() function:

poisson_data = pf.get_data(model = "Fepois")
pf.fepois("Y ~ X1 + X2 | f1 + f2", data = poisson_data).summary()

###

Estimation:  Poisson
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.007 |        0.035 |    -0.190 |      0.850 | -0.075 |   0.062 |
| X2            |     -0.015 |        0.010 |    -1.449 |      0.147 | -0.035 |   0.005 |
---
Deviance: 1068.169

IV Estimation via three-part formulas

Last, PyFixest also supports IV estimation via three part formula syntax:

fit_iv = pf.feols("Y ~ 1 | f1 | X1 ~ Z1", data = data)
fit_iv.summary()

###

Estimation:  IV
Dep. var.: Y, Fixed effects: f1
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.025 |        0.115 |    -8.930 |      0.000 | -1.259 |  -0.790 |
---

Call for Contributions

Thanks for showing interest in contributing to pyfixest! We appreciate all contributions and constructive feedback, whether that be reporting bugs, requesting new features, or suggesting improvements to documentation.

If you'd like to get involved, but are not yet sure how, please feel free to send us an email. Some familiarity with either Python or econometrics will help, but you really don't need to be a numpy core developer or have published in Econometrica =) We'd be more than happy to invest time to help you get started!

Contributors ✨

Thanks goes to these wonderful people:

_styfenschaer
💻

_{Niall Keleher}
🚇 💻

_{Wenzhi Ding}
💻

_{Apoorva Lal}
💻 🐛

_{Juan Orduz}
🚇 💻

_{Alexander Fischer}
💻 🚇

_aeturrell
✅ 📖 📣

This project follows the all-contributors specification. Contributions of any kind welcome!

pyfixest's People

Contributors

Stargazers

Watchers

Forkers

paulofelipe soodoku apoorvalal styfenschaer pachadotdev nkeleher imkhali pilipentseva wenzhi-ding realseqi sulinchowdhury juanitorduz natgkel

pyfixest's Issues

Replace np.isnan with pd.isna

... when checking for NA values in cluster variable.

Link to original issue

p-value and some t-stat calculations differ from R fixest

Some more toying around, and I'm getting different p-value calculations from the original fixest, even in cases where estimates and t-stats match:

With clustering, the t-stats match but p-values are different:

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'Year'}))
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: inspection_score 
# 
#            Estimate  Std. Error    t value  Pr(>|t|)
# Intercept 93.627262    0.321288 291.412047  0.000000
#   Weekend  2.096548    0.862544   2.430656  0.015072

library(fixest)
library(causaldata)

data(restaurant_inspections)

feols(inspection_score ~ Weekend , data = restaurant_inspections, vcov = ~Year)

# OLS estimation, Dep. Var.: inspection_score
# Observations: 27,178 
# Standard-errors: Clustered (Year) 
# Estimate Std. Error   t value  Pr(>|t|)    
# (Intercept) 93.62726   0.321288 291.41205 < 2.2e-16 ***
#   WeekendTRUE  2.09655   0.862544   2.43066  0.029106 *  
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 6.25408   Adj. R2: 8.241e-4

With IID and HC1, the t-stats and p-values differ by very small amounts

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'iid')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036224 55497.059216  0.000000
# Weekend   -0.862863    0.412097    -2.093833  0.036275

feols(Year~Weekend, data = restaurant_inspections)
OLS estimation, Dep. Var.: Year
Observations: 27,178 
Standard-errors: IID 
               Estimate Std. Error     t value  Pr(>|t|)    
(Intercept) 2010.343815   0.036226 55495.01719 < 2.2e-16 ***
WeekendTRUE   -0.862863   0.412112    -2.09376  0.036291 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 5.94874   Adj. R2: 1.245e-4

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'HC1')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036181 55562.889684  0.000000
# Weekend   -0.862863    0.471032    -1.831856  0.066973

feols(Year~Weekend, data = restaurant_inspections, vcov = 'hc1')
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: Heteroskedasticity-robust 
# Estimate Std. Error     t value  Pr(>|t|)    
# (Intercept) 2010.343815   0.036182 55561.86747 < 2.2e-16 ***
#   WeekendTRUE   -0.862863   0.471041    -1.83182  0.066989 .

Reproduce applied econometrics textbook examples in notebooks

Motivation:

uncovers potentially needed functionality
makes sure that pyfixest actually runs for "real world" problems and not simply my synthetic test datasets

Textbooks with code:

the effect (data in the causaldata package)
the mixtape (data also included in the causaldata package)
Data Analysis Business & Econ (github repo)

Implement ssc() function

start with copying fwildclusterboot::boot_ssc()
move on to all features of fixest::ssc()

Link to original issue

P-values for clustered errors based on t(G-1) distribution

Instead of normal.

Link to original issue

Unit Tests

test feols front end against fixest::feols
test parsing of formula syntax
add continuous integration
add even more tests

General quality improvements

Improve:

the documentation
error messages
performance
test untested functionality

ropensci statistical standards as a reference.

User friendly error handling

For common errors, raise user friendly error messages.

HC3 and CRV3 do not seem to match

investigate and fix
tests are already added

Check for mulicollinearity

After fixed effects are projected out, check if the resulting design matrix Xtilde is singular.

Bug for interactions

KeyError: "['X1:X2'] not in index". Happens because variable "X1:X2" is not contained in the input data. Likely requires even more reshuffling of the formula parsing - demeaning - model matrix pipeline.

Pass NAs to vcov() to use for clustered standard errors

pass NAs to cluster variable
raise exception when there is an additional missing variable in the clustering variable (i.e. when the clustering variable is not part of the model)

Multiple estimations: add 'split' option

Simply add one more for loop ;)

Additional methods

tidy()
summary()
etable()
iplot() (after implementing i())

Singleton fixed effects

Add message when data is dropped due to singleton fixed effects
add tests with singleton fixed effects

Link to original issue

Make sure / check that dependent variables are of numeric types

... as strange things happen when dependent variables have dtype object.

Add multiple variance-covariance matrices

Bug for i() formula without fixed effects

I.e. for fml = "Y ~ i(X2, X1)". No problems for formulas with fixed effects.

Multiple Estimations and Formula Syntax

Match fixest standard errors exactly

Status quo: minor deviations due to small sample adjustments.

Required: implementation of ssc() function with all fixed effect options.

Drop singleton fixed effects

Support for datetime variables

Was inspired to try using a datetime in a regression, given that statsmodels handles these incorrectly. Currently these do not appear to be supported in pyfixest

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

res['DT'] = [pd.to_datetime(str(y)+'-01-01 00:00:00') for y in res['Year']]

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ DT', vcov = 'iid')
fixest.summary()
# UFuncTypeError: ufunc 'multiply' cannot use operands with types dtype('int32') and dtype('<M8[ns]')

Note that this also occurs without the 00:00:00, or if using a datetime.date() instead of a Pandas datetime. Compare to R:

library(fixest)
library(causaldata)

data(restaurant_inspections)

restaurant_inspections$Time = lubridate::ymd_hms(paste0(restaurant_inspections$Year, '-01-01 00:00:00'))

feols(Year~Time, data = restaurant_inspections)
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: IID 
# Estimate   Std. Error  t value  Pr(>|t|)    
# (Intercept) 1.970001e+03 3.323279e-05 59278828 < 2.2e-16 ***
#   Time        3.170000e-08 2.580000e-14  1226887 < 2.2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 7.994e-4   Adj. R2: 1

That's probably enough tinkering for me today! Sorry to load up the Issue page.

Add support for two-way clustering

Simply apply vcov() cluster three times: vcov_a + vcov_b - vcov_ab.
For ssc, see here.

Bug: formula with more than one regressor and fixed effects

E.g.

fit = feols(fml = 'Y ~ X1 + X2 | X3', vcov = {'CRV1':'group_id'}, data = data)
# invalid value encountered in sqrt
#  np.sqrt(np.diagonal(self.vcov[x]))

Multicollinearity?

Bug for csw() syntax

Tests

compare against fixest
internal benchmarks

vcov() method does not update summary info

Example:

import pyfixest as pf
import numpy as np
from pyfixest.utils import get_data

fixest = pf.Fixest(data = data)
fixest.feols("Y~X1 | csw0(X2, X3)", vcov = {'CRV1':'id'})
fixest.summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.220649 30.130262   0.00000
#        X1 -0.141200    0.211081 -0.668937   0.50369
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274    0.210556 -0.675707  0.499383
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.096317    0.204801 -0.470296  0.638247
fixest.vcov({'CRV3':'group_id'}).summary()
>>> fixest.vcov({'CRV3':'group_id'}).summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.229614 28.953831  0.000000
#        X1 -0.141200    0.207516 -0.680428  0.502745
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274     0.20774 -0.684867   0.49999
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error  t value  Pr(>|t|)
# X1 -0.096317    0.206282 -0.46692  0.644768
#

CRV3 inference

currently blocked by default -> unblock
update with code from statsmodels PR

prepare initial pypi release

oneway crv3 inference

Add requirements.txt

Currently there is no indication of what dependencies are required for the package. At the moment I'm repeatedly trying to load Fixest and installing the missing packages it notices one by one.

Implement CRV3 for arbitrary fixed effects

Should be fairly straightforward: for no clustering, run MNW's summclust algo, else do what sandwich does. E.g. do along these lines:

                if self.has_fixef == False:
                    # inverse hessian precomputed?
                    tXX = np.transpose(self.X) @ self.X
                    tXy = np.transpose(self.X) @ self.Y

                    # compute leave-one-out regression coefficients (aka clusterjacks')
                    for ixg, g in enumerate(clusters):

                        Xg = self.X[np.equal(ixg, group)]
                        Yg = self.Y[np.equal(ixg, group)]
                        tXgXg = np.transpose(Xg) @ Xg
                        # jackknife regression coefficient
                        beta_jack[ixg,:] = (
                            np.linalg.pinv(tXX - tXgXg) @ (tXy - np.transpose(Xg) @ Yg)
                        ).flatten()

                else:

                    for ixg, g in enumerate(clusters):
                        data = self.data[np.equal(ixg, group)]
                        model = Fixest(data)
                        model.feols(self.formula, vcov = "iid")
                        beta_jack[ixg,:] = model.beta_hat

Issues with specifying vcov

Currently, vcov is a required positional argument in feols. Note this also means you can't specify the vcov using the .vcov method without also specifying it in the feols call as a positional argument. Suggest matching the defaults from the R package of vcov = 'iid' with no fixed effects specified, or clustered at the level of the first fixed effect if fixed effects are specified., or overwriting that with whatever was set in .vcov.
help(fixest.feols) suggests using dict("CRV1":"clustervar") to specify a cluster variable but this is improper syntax.
In my example data I'm having trouble setting any sort of clustering variable alongside a set of fixed effects (pip install causaldata; note there are no missing values in this data):

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest as pf
import numpy as np

res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)

fixest.feols('inspection_score ~ Weekend | business_name', vcov = 'iid')
# runs fine

fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

# maybe it prefers numbers?
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'Year'}))
# IndexError: index 27134 is out of bounds for axis 0 with size 27076

# Or integers?
res['random_category'] = np.random.randint(0, 10, res.shape[0])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'random_category'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# or categoricals?
res['categorical'] = pd.Categorical(res['random_category'])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'categorical'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# Works fine without the FEs
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'categorical'}))

# although still not for business_name
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


# Maybe it can't handle single-row FEs alongside clusters?
res['counts'] = (res
		   .groupby('business_name')['business_name']
		   .transform('size'))
fixest = pf.Fixest(data = res.query('counts > 1'))
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# ValueError: matrix should have the same number of rows as fixed effect IDs.

Bug in i() method

For some reason, signs of integers are swapped, i.e. for event studies with time-to-treatment (ttt) -27, ...., 0, ..., 21, the estimated coefs are labeled as -21, ..., 0, ..., 27.

Bug in multiple estimations with fixed effects

E.g.

feols('y1 + y2 ~ x1 | species ', data = base, vcov = "hetero")

leads to

IndexError: index 1 is out of bounds for axis 1 with size 1

Handle NA values gracefully

... handle missing values - they need to be dropped

Custom fixed effects demeaning function

Write a custom demean() function that allows for NA values in either X or Y
further, allow for weights (for WLS)
implement dropping of singletons
note that there are a few errors in the current version of demean() (i.e. the handling of missings + multiple fixed effects)

Allow for mathematical transformations of variables with advanced formula syntax

E.g. currently it is not possible to run a model like Y ~ X1 + csw(log(X2), X3) (but it is possible to run e.g. log(Y) ~ X1 + csw(X2, X3).

Enable cluster jackknife inference

Fairly straightforward:

simply add clustering variant CRV3-jackknife, e.g. as function argument {'CRV3-jackknife':'group_id'}
uncomment code here
add tests, done!

ssc() function does not work properly

i.e. adj = False does nothing

Pytest freezes for tests of interacted fixed effects via i()

Tests here
potential reason?

Documentation

Fixest
feols
...
Prepare mkdocs.

Add support for wild (cluster) bootstrap inference via wildboottest

wildboottest repo
For now, only for regressions without fixed effects.

Mimic fixest vcov defaults

drop requirement to set vcov argument in .feols()
for no specified vcov without fixed effects: iid inference
for no specified vcov with fixed effects: CRV1 clustered at the first fixed effects level

See here.

Handling of Categorical Variables

Build around formulaic's C(..., levels = ...) option.
Need to pass levels information to formula.
Rename levels to "ref" so that users can only provide a reference str instead of a full list.
Is there a need for a dedicated i() option to interact with a categorical variable? The key advantage of the i() method seems custom tooling around it, e.g. iplot. But that could be handled with plotting method with decent string regex for coef names (i.e. for variables included in i()). Maybe not 100% error prone?

replace patsy dependency byformulaic

Different Design Matrix Combinations

For options, see here:

sw()
csw()
fsplit

IV support

Required Steps:

allow three-part formulas
implement IV estimation & inference. Start with just-identified models.
update summary() and tidy() methods. For tidy(), simply add index "stage1" and "stage2". Split summary() into two parts. Check out what fixest does.

Nice to have's (for now):

common IV diagnostics
AR confidence intervals

In Practice:

implement IV estimator (Z'X)^(-1) Z'Y.
always create X and Z. For OLS, set X = Z
pass everything through, done (at least for models with only one endog variable)
after this is implemented, allow for over identified models. This requires implementation of the 2SLS estimator + updates to inference procedures

Improve performance

See here for faster solving of the least squares problem.
faster demeaning algo via numba / c++
clean up code, drop redundancies, etc
for a given column in the design matrix, only project out fe's when not already done (for multiple estimation)

Make sure to drop singleton fixed effects from the clustering matrix

Link to original issue: link

s3alfisc / pyfixest Goto Github PK

pyfixest's Introduction

PyFixest: Fast High-Dimensional Fixed Effects Regression in Python

Features

Installation

Benchmarks

Quickstart

Multiple Estimation

Adjust Standard Errors "on-the-fly"

Poisson Regression via fepois()

IV Estimation via three-part formulas

Call for Contributions

Contributors ✨

pyfixest's People

Contributors

Stargazers

Watchers

Forkers

pyfixest's Issues

Recommend Projects

Recommend Topics

Recommend Org

Poisson Regression via `fepois()`