Dominance-Analysis : A Python Library for Accurate and Intuitive Relative Importance of Predictors

This package is designed to determine relative importance of predictors for both regression and classification models. The determination of relative importance depends on how one defines importance; Budescu (1993) and Azen and Budescu (2003) proposed using dominance analysis (DA) because it invokes a general and intuitive definition of "relative importance" that is based on the additional contribution of a predictor in all subset models. The purpose of determining predictor importance in the context of DA is not model selection but rather uncovering the individual contributions of the predictors.

In case the target is a continuous variable, the package determines the dominance of one predictor over another by comparing their incremental R-squared contribution across all subset models. In case the target variable is binary, the package determines the dominance over another by comparing their incremental Pseudo R-Squared contribution across all subset models.

Installation

Use the following command to install the package:

pip install dominance-analysis

Important Parameters

data : Complete Dataset, should be a Pandas DataFrame.
target : Name of the target variable, it should be present in passed dataset.
top_k : No. of features to choose from all available features. By default, the package will run for top 15 features.
objective : It can take value either 0 or 1. 0 for Classification and 1 for Regression. By default, the package will run for Regression.
pseudo_r2 : It can take one of the Pseudo R-Squared measures - "mcfadden","nagelkerke", "cox_and_snell" or "estrella", where default="mcfadden". It's not needed in case of regression (objective=1).

Dominance Analysis - The Significance!

Dominance Analysis, according to Azen and Budescu meets three important criteria for measuring relative importance. First, the technique should be defined in terms of its ability to reduce error in predicting the outcome variable. Next, it should permit direct comparison of measures within a model (that is, X₁ is twice as important as X₂). Finally, the technique should permit inferences concerning an attribute's direct effect (that is, when considered by itself), total effect (that is, when considered with other attributes) and partial effect (that is, when considered with various combinations of other predictors). Hence, Dominance analysis is both robust and intuitive and its interpretation is also very straightforward.

Dominance Analysis - The Math!

Dominance Analysis is unique as it measures relative importance in a pairwise fashion, and the two predictors are compared in the context of all 2^(p−2) models that contain some subset of the other predictors. So, if we have a total of 'p' predictors, we will build 2^p-1 models (all possible subset models) and compute the incremental R² contribution of each predictor to the subset model of all other predictors. The additional contribution of a given predictor is measured by the increase in R² that results from adding that predictor to the regression model.

Let's consider a scenario when we have 4 predictors; X₁, X₂, X₃ and X₄. We will have to build a total of 2⁴-1 models i.e. 15 models- ⁴C₁ = 4 models with only one predictor, ⁴C₂ = 6 models with two predictors each, ⁴C₃ = 4 models with three predictors each and 1 (⁴C₄) complete model with all 4 predictors. Thus, the additional contributions of X₁ are computed as the increases in the proportion of variance accounted for when X₁ is added to each subset of the remaining predictors (i.e., the null subset {.}, {X₂}, {X₃}, {X₄}, {X₂X₃}, {X₂X₄}, {X₃X₄} and {X₂X₃X₄}). Similarly, the additional contributions of X₂ are the increases in the proportion of variance accounted for when X₂ is added to each subset of the remaining predictors (i.e., the null subset {.}, {X₁}, {X₃}, {X₄}, {X₁X₃}, {X₁X₄}, {X₃X₄} and {X₁X₃X₄})

Below is the illustration of formulas used to compute the averaged additional contributions of X₁ and X₂ within model size in the poupulation with four predictors (We use the notation to represent the proportion of variance in Y that is accounted for by the predictors in the model X. For example, represents the proportion of variance in Y that is accounted for by the model consisting of X₁ and X₃. The additional contribution of a given predictor is measured by the increase in proportion of variance that results from adding that predictor to the regression model):

The measure for proportion of variance that we have used for regression is R² but since we don't have R² in logsitic regression/classification models, we have used Pseudo R².

The beauty of the math of Dominance Analysis is that the sum of the overall average incremental R² of all predictors is equal to the R² of the complete model (model with all predictors). Hence, the total R² can be attributed to each predictor in the model. Below is an illustration of Dominance Analysis in the Population for Hypothetical example with four predictors:

It can bee seen that the Percentage Relative Importance of predictors has been computed by dividing the Overall Average Incremental R² contribution of predictors by the R² of the complete model. This explains the intuitive nature of Dominance Analysis wherein the overall R² of the model can be attributed to individual predictors within the model.

Pseudo R-Squared for Classification Task / Logistic Regression

Measures of fit in logistic regression can be classified by those based on sums of squares and those based on maximum likelihood statistics. Reviews of a variety of measures of fit proposed for logistic regression can be found in Amemiya (1981), Menard (2000), Mittlbock and Schemper (1996) and Zheng and Agresti (2000). Given the large number of proposed measures, criteria for defining appropriate R² analogues need to he determined. The following criteria, which are also found in the linear regression literature (e.g., Kvilseth. 1985: Van den Burg & Lewis, 1988), were used to select R² analogues for logistic regression:

Boundedness: The measure should vary between a minimum of zero, indicating complete lack of fit, and a maximum of one, indicating perfect fit.
Linear invariance: The measure should be invariant to nonsingular linear transformations of the variables (Ys and Xs).
Monotonicity: The measure should not decrease with the addition of a predictor.
Intuitive Interpretability: The measure of fit is intuitively interpretable, in that it agrees with the scale of the linear case for intermediate values.

Based on these criteria, the following four R² analogues were chosen that satisfied at least three of these four properties:

1. McFadden's Pseudo-R Squared

McFadden's Pseudo-R squared measure is defined as :

$\Large R_{McFadden}^{2}=1-\frac{log(L_{full})}{log(L_{null})}$

This measure satisfies all the four properties.

2. Nagelkerke Pseudo-R Squared

Nagelkerke Pseudo-R squared measure is defined as :

$\Large R_{Nagelkerke}^{2}=\frac{1-\{\frac{L_{null}}{L_{full}}\}^{2/N}}{1-L_{null}^{2/N}}$

This measure satisfies three of the four properties and doesn't satisfy the property of Interpretability.

3. Cox and Snell R-Squared

Cox and Snell Pseudo-R squared measure is defined as :

$\Large R_{Cox\&Snell}^{2}=1-\{\frac{L_{null}}{L_{full}}\}^{2/N}$

This measure satisfies three of the four properties.

4. Estrella R-Squared

Estrella Pseudo-R squared measure is defined as :

$\Large R_{Estrella}^{2}=1-\}\frac{LL_{full}}{LL_{null}}\}^{\frac{2}{N}*LL_{null}}$

This measure satisfies all the four properties.

Using each of these four R² analogues, the additional contribution of a given predictor to a specific logistic model can be measured as the change (i.e., increase) in the R² analogues when the predictor is added to the model. Even though, all the four measures will give similar results, we recommend using either Estrella's (1998) model fit measure or McFadden's (1974) measure for conducting dominance analysis in logistic regression. We have a slight preference for McFadden's measure (and that is what the package will compute by default) because it is computationally simpler, but both McFadden's and Estrella’s measures satisfy the minimum requirements for an R² analogues.

Note: Since, Dominance Analysis is computationally intensive as it builds all subset model (2^p-1 models), we have provided the user the flexibility to choose number of top predictors that they want to compute relative importance for. For regression, Top K features are selected based on F-regression and for classification it is based on Chi-Squared statistic. Dominance Analysis can be used in combination with Principal Component Analysis (PCA) or Factor Analysis or any other feature reduction algorithm for getting accurate and intutive importance of predictors.

Dominance Statistics

As described earlier, a relative importance measure should be able to describe a predictor's direct, total and partial effet, therefore in the Dominance Statistics, we have come up with four different types of Dominance measures. Below are the definition and interpretation of the measures:

Interactional Dominance - This is the incremental R² contribution of the predictor to the complete model. Hence, the Interactional Dominance of a particular predictor 'X' will be the diffrence between the R² of the complete model and the R² of the model with all other predictors except the particular predictor 'X'.
Consider a scenario when we have Y as the dependent variable and four predictors X₁, X₂, X₃ and X₄, let R²_Y.X₁,X₂ be the R² of the model between Y and X₁, X₂ ; R²_Y.X₁,X₃ be the R² of the model between Y and X₁, X₃ so on and so forth. In this case, the interactional dominance of predictor X₁ will be R²_{Y.X₁,X₂,X₃,X₄} - R²_{Y.X₂,X₃,X₄}.
Hence, interactional dominance can be interpreted as the incremental impact or the dominance that a predictor has in presence of all other predictors.
Individual Dominance -
Average Partial Dominance -
Total Dominance -

Complete code for below examples is available in example folder or the following public kernels on kaggle: Regression - Dominane Analysis on Boston House Price Data & Classification- Dominance Analysis on Breast Cancer Dataset

User Guide for computing Relative Importance when the response variable is Continous

Using Boston Housing Dataset downloaded from: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Selecting top K features and getting R² of the Complete Model

from dominance_analysis import Dominance_Datasets
from dominance_analysis import Dominance
boston_dataset=Dominance_Datasets.get_boston()
dominance_regression=Dominance(data=boston_dataset,target='House_Price',objective=1)

Incremental R-Squared

incr_variable_rsquare=dominance_regression.incremental_rsquare()

Plot Incremental R-Squared and the Dominance Curve

dominance_regression.plot_incremental_rsquare()

Dominance Statistics (R-Squared)

dominance_regression.dominance_stats()

User Guide for computing Relative Importance when the response variable is Binary

Breast Cancer Wisconsin (Diagnostic) dataset downloaded from: https://goo.gl/U2Uwz2

Selecting top K features and getting Pseudo R² of the Complete Model

from dominance_analysis import Dominance_Datasets
from dominance_analysis import Dominance
breast_cancer_data=Dominance_Datasets.get_breast_cancer()
dominance_classification=Dominance(data=breast_cancer_data,target='target',objective=0,pseudo_r2="mcfadden")

Incremental Pseudo R-Squared

incr_variable_rsquare=dominance_classification.incremental_rsquare()

Plot Incremental Pseudo R-Squared

dominance_classification.plot_incremental_rsquare()

Dominance Statistics (R-Squared)

dominance_classification.dominance_stats()

Authors & License

The Dominance Analysis package is based on the concept developed by Azen and Budescu (see references). This package is released under a MIT License. Dominance Analysis Python package has been developed by Shashank Shekhar, Sajan Bhagat and Kunjithapatham Sivakumar . Pull requests submitted to the GitHub Repo are highly encouraged!

References

Azen, R. (2000). Inference for predictor comparisons:Dominance analysis and the distribution of R² differences. Dissertation Abstracts International B, 61/10, 5616.
Azen, R., Budescu, D. V., & Reiser, B. (2001). Criticality of predictors in multiple regression. British Journal of Mathematical and Statistical Psychology, 54, 201–225.
Azen, R., Budescu, D. V. (2003). The Dominance Analysis Approach for Comparing Predictors in Multiple Regression. Psychological Methods, 2003, Vol. 8, No. 2, 129–148. https://doi.org/10.1037/1082-989X.8.2.129
Azen, R., Budescu, D. V. (2006). Comparing Predictors in Multivariate Regression Models: An Extension of Dominance Analysis. Journal of Educational and Behavioral Statistics Summer 2006, Vol. 31, No. 2, pp. 157-180. https://doi.org/10.3102/10769986031002157
Azen, R., Traxel, N. (2009). Using Dominance Analysis to Determine Predictor Importance in Logistic Regression. Journal of Educational and Behavioral Statistics September 2009, Vol. 34, No. 3, pp. 319-347. https://doi.org/10.3102/1076998609332754
Budescu, D. V. (1993). Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. Psychological Bulletin, 114(3), 542-551. https://doi.org/10.1037/0033-2909.114.3.542
Luo, W., & Azen, R. (2013). Determining Predictor Importance in Hierarchical Linear Models Using Dominance Analysis. Journal of Educational and Behavioral Statistics, 38(1), 3-31. https://doi.org/10.3102/1076998612458319

coalesced / dominance_analysis Goto Github PK

dominance_analysis's Introduction

Dominance-Analysis : A Python Library for Accurate and Intuitive Relative Importance of Predictors

Installation

Important Parameters

Dominance Analysis - The Significance!

Dominance Analysis - The Math!

Pseudo R-Squared for Classification Task / Logistic Regression

Dominance Statistics

User Guide for computing Relative Importance when the response variable is Continous

User Guide for computing Relative Importance when the response variable is Binary

Authors & License

References

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent