manifoldai / merf Goto Github PK

View Code? Open in Web Editor NEW

216.0 216.0 51.0 13.01 MB

Mixed Effects Random Forest

License: MIT License

Jupyter Notebook 98.18% Python 1.57% Dockerfile 0.01% Makefile 0.25%

merf's People

Contributors

Stargazers

Watchers

Forkers

myffy hmartiniano bodapa transconnectome lidscott1 fiercebeard45 cunningjames alephnotation emilysheen arose13 kauttoj akshaypuri05 tommeowmeow mlbyte maryamag85 jpzhangvincent ruanhq rcmike123 ymohit ittegrat emrecimen afnindar c-foschi phoebetwong zjl0714 rdswartz94 marconis mathewtjoseph liucijalat findalexli saheelbreezo kiminh wonderdong11 cleli94 atanunow ritviksahajpal mxhdtc kundan001 vishalbelsare 321hg lucaslu2021 rosaguilar preciouseunice lcopey longshen931 eastrain517 hu038 pleiadian53 thespydrebyte phenology matteamueller

merf's Issues

saving merf as a pickle

Hi, it took me a long time to train the MERF on my dataset. After training, I tried saving it as a pickle object and reload it to make prediction on test data. However, I got an error saying the pkl is a dictionary that doesn't have a predict method. Any ideas on how to get around this? I don't want to train the model every time I pass new test data. Thanks.

AttributeError                            Traceback (most recent call last)

<ipython-input-55-0bd2bd5ab064> in <module>()
----> 1 y_hat_known_merf = loaded_model.predict(X_known, Z_known, clusters_known)
      2 y_hat_known_merf

AttributeError: 'dict' object has no attribute 'predict'

Create sphinx documentation

Issue with git cloning and pip installing from github

Thanks for an excellent package! I am very keen to try out your latest updates from the last few days (including using lighgbm, SHAP values). However, I cannot seem to pip install directly from github or clone because of this error:

Clone failed
				Invalid path 'data/Rossmann Store Sales | Kaggle eval.pdf'
				unable to checkout working tree
				warning: Clone succeeded, but checkout failed.
				You can inspect what was checked out with 'git status'
				and retry with 'git restore --source=HEAD :/'

It seems like this error is because of the file name 'Rossmann Store Sales | Kaggle eval.pdf' in the data directory. Is it possible to rename the two pdf files in that directory?

TypeError: init() got an unexpected keyword argument 'n_estimators'

When trying to set arguments like n_estimators for MERF, MERF.fit returns the above initialisation error. It works fine with default setting.
This seems to suggest that MERF to sklearn random forest API is broken. Is this a version issue?
I'm using:
Python 3.9.5
scikit-learn==0.24.2

Typechecks

Thank you for this wonderful package and your efforts!

For future users, would it be possible to add type-checks and more verbose error statements to the fit func of the MERF class when the input type deviates from the expected input type? This led to a bit of reverse engineering to figure out why the underlying linear alg was failing when, for example, inputting a NUMPY array in place of an expected pandas series.

(Also happy to contribute...when I have the time).

Thanks!

Can not run examples given by notebooks

Greetings,

I can not reproduce notebooks given in the ./notebooks. For example, run code given in the Real World MERF Examples notebook thrower out two errors.

When I tried to import the modules, the first error popped out

No module named 'merf.evaluator'

When I tried to call mrf.fit(X_train, Z_train, clusters_train, y_train), the error popped out

ValueError: operands could not be broadcast together with shapes (10,) (162,)

I also attached the full information below FYI.
BTW, I installed the merf using pip install merf. It should be the latest version.

Any suggestion is appreciated.

ValueError Traceback (most recent call last)
in ()
13 clusters_train = train['Subject']
14 y_train = train['Reaction']
---> 15 mrf.fit(X_train, Z_train, clusters_train, y_train)
16
17 # Mixed Effects Random Forest Test

~/anaconda/lib/python3.6/site-packages/merf/merf.py in fit(self, X, Z, clusters, y)
142
143 # Compute y_star for this cluster and put back in right place
--> 144 y_star_i = y_i - Z_i.dot(b_hat_i)
145 y_star[indices_i] = y_star_i
146

~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
713 lvalues = lvalues.values
714
--> 715 result = wrap_results(safe_na_op(lvalues, rvalues))
716 return construct_result(
717 left,

~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
674 try:
675 with np.errstate(all='ignore'):
--> 676 return na_op(lvalues, rvalues)
677 except Exception:
678 if isinstance(rvalues, ABCSeries):

~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
650 try:
651 result = expressions.evaluate(op, str_rep, x, y,
--> 652 raise_on_error=True, **eval_kwargs)
653 except TypeError:
654 if isinstance(y, (np.ndarray, ABCSeries, pd.Index)):

~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
208 if use_numexpr:
209 return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 210 **eval_kwargs)
211 return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)
212

~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs)
119
120 if result is None:
--> 121 result = _evaluate_standard(op, op_str, a, b, raise_on_error)
122
123 return result

~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
61 _store_test_result(False)
62 with np.errstate(all='ignore'):
---> 63 return op(a, b)
64
65

ValueError: operands could not be broadcast together with shapes (10,) (162,)

R Implementation

I just came across this algorithm on towardsdatascience.com. Is there any interest in implementing this is the R language as well?

Compute and store the MSE per iteration using oob estimates

Just looking at GLL is not good enough.

Introduction of Sample Weights

Hi, I am looking to extend the functionality to allow a sample_weight vector to be passed in... sklearn.ensemble.RandomForestRegressor requires the sample_weight as a separate vector so that it can naturally work out the variance impurity (MSE based) and node averages correctly.

I am wondering whether I could simply use y * sample_weight as y throughout the merf.fit() function and then extend the RF fit to something like rf.fit(X, y_star / sample_weight, sample_weight)? I assume this would work from what I understand of the current implementation, since it appears that the normalisation doesn't consider cluster sizes and thus there should be no need to weight here (unfortunately I don't have access to the MERF paper):

# Normalize the sums to get sigma2_hat and D_hat
sigma2_hat = (1.0 / n_obs) * sigma2_hat_sum
D_hat = (1.0 / n_clusters) * D_hat_sum

To clarify, I wouldn't expect D_hat to be normalised as above if the cluster size (or in my case, cluster weighting) was important to the overall process? If it had have been important, I would have expected a D_hat per cluster (i.e., in proportion to how many samples the cluster held versus the total samples).

Add partial dependency plotting function

It would be nice if there was a function for creating partial dependency plot data or plots. This would help with translating information to consumers and clients

method predict gives values higher than 1

Here is the merf model, that gives prob higher than one
zeroes_zf_m.pickle.zip

X = np.array([[0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.]])
with open('zeroes_zf_m.pickle', 'rb') as f:
    merf_model = pkl.load(f)
    
print(model.predict(X, Z=[[1]], clusters=pd.Series(42)), model.predict(X, Z=[[1]], clusters=pd.Series('42')))

Is it correct? My opinion that probs must be [0,1]

Incompatibility with scikit-learn: MERF does not implement get_params

MERF does not implement get_params. This means that scikit-learn cannot clone a MERF predictor as needed for e.g., RandomSearchCV to optimize the hyper parameter of the fixed effects model.

How to Calculate Feature Importance

Hi,

Thank you for providing this package. I am a PhD student at NC State University. My labmate used your package to conduct her research. Does your code have any provisions to assess the importance of the fixed effect features?

Thanks,
Mehak

Add possibility to run classification

Unable to install on Python 3.7 using conda

I'm trying to install merf using conda as described here: https://anaconda.org/leylabmpi/merf, and getting the following error:

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - merf -> python[version='>=3.6,<3.7.0a0']

Your python: python=3.7

Any advice?

Create setup.py files for package installation

Implement early stopping

As in the paper.

Clearly document the dimensions of Z

Has to be array of arrays... so we can transpose and all the matrix math works out.

Longitudinal Data

Dear Authors,

Thank you so much for this MERF approach. I would like to know if I could use MERF with longitudinal data represented as follows, which I believe to be the case as I have read on several websites that this is the case but did not find any exemples:

The following is a representation of the data (very simplified):

The sole non-longitudinal characteristic is the patient's name; the rest are longitudinally represented using the suffixes _1, _2, and _3 to designate waves (timepoints) one, two, and three (with e.g 1 year gab between waves). The death column is the class variable for predicting mortality (binary) which from what I have seen it is not really supported to do classification right #11 ? Except from rounding up or down probability from regressor but is that ideal, I do not have a clear thought of this.

The data representation:

patient_name, age_1, biomarkerX_1, smoke_1, age_2, biomarkerX_2, smoke_2, age_3, biomarkerX_3, smoke_3, death

Cheers,

Setup logging in init.py file

Look at simplex for example

Calculate the log(det(A)) so there is no overflow

https://blogs.sas.com/content/iml/2012/10/31/compute-the-log-determinant-of-a-matrix.html

fuck me cholesky

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.cholesky.html

will the MERF select best model based on validation set?

mrf.fit(X_train, Z_train, clusters_train, y_train, X_val, Z_val, clusters_val, y_val)
y_hat_known = mrf.predict(X_known, Z_known, clusters_known)

is the mrf model the best performance model based on X_val, Z_val, clusters_val, y_val?

BW,
Yang

Regarding Merf Plot by Cluster

I am using Python 3.6.5 and Pandas version 1.0.1. I receive the above when attempting to plot a specific cluster by its respective id.

I noticed here in the example notebook that Panel is deprecated. I welcome feedback on the best way to address this.

Random Effect Features Importance

Thank you so much for providing this package. Does your code have any provisions to assess the importance of the random effect features?
How to display the PTEV and PREV of real data? I only found PTEV and PREV from data simulation.
Thank you,
Retno

Translation from MERF random effects nomenclature to align with other implementations

First, thank you for creating this package! It is kind of exactly what I am looking for. However, I have some questions that stem from my prior experience with mixed models on other platforms (notably the lme4 package for R), and the documentation and examples aren't helping me translate my knowledge of how random effects are specified/named in MERF.

For example, in lme4, random effects are designated as having slopes and intercepts (which I think correspond to "clusters" and "covariates" in MERF? With 1s in the covariates matrix indicating the intercepts?).

So if "subject" (in an experiment) or "county") like in the radon example are grouping variables over which there are multiple observations, one could specify a random intercept for subject (or county). Then, one could additionally specify a random slope for some variable by the grouping variable (such as to specify a random slope for experimental condition by subject, to allow the model to estimate the variance of how much subjects differ in their response to the experimental manipulation, or that counties could have a random slope for floor, allowing counties to differ in how much each floor impacts random levels in the model). I'm not quite sure if I'm translating these to the MERF nomenclature correctly.

Moreover, lme4 allows the researcher to specify multiple, crossed random effects (e.g., random intercepts for both experimental subjects and experimental items, as well as random slopes for variables by both subject and item).

Getting to the point:

My looking through the examples leads me to think that the clusters argument is the column containing the IDs for which random intercepts are generated: Is this the case?
I get the inclination that the Z matrix includes 1s for the random intercepts, but can include a second column for random slopes (i.e., the covariates): Is this the case? Can there be more columns?

More generally, some comments in the notebooks or documentation about what these variables are (concretely, and what form they must/can/cannot take, and possibly relationships to how random effects are specified in other mixed modeling packages) would be very helpful.

Can MERF handle crossed random effects structures?
Finally, does MERF provide variable importance measures from the fit forest (analogous to those produced in sklearn, and from randomForest and party::cforest in R)? I couldn't find mention of that in the readme or the notebooks.

Thanks!

Submit to PyPi

MERF for classification

Hi!

I was looking for a solution to include random effect to a random forest classifier for tree species identification, and gladly found your repo. My understanding of the MERF code in merf.py, the algorithm right now can perform regression but the EM is not designed to optimize a classification problem.

If I understood it correctly, the main issue to make it happen is in identifying the equivalent of yi - f_hai_i in the context of classification (maybe using cross entropy?). Would you be interested in adding this feature to the package?

Any thought if substituting yi - f_hat_i to cross.entropy(yi, f_hat_i) would break the math of the MERF?

Thank you so much for putting this repo together,
Sergio

Line # 96 in merf.merf.py might be better when modifying "len(indices_i)" to "sum(indices_i)"

Hello,
Thanks for your great work on merf!
When I debug merf, I found that there is one line that does not work in any case:

63   def predict(self, X: np.ndarray, Z: np.ndarray, clusters: pd.Series):
           ...
          for cluster_id in self.cluster_counts.index:
                indices_i = clusters == cluster_id

               # If cluster doesn't exist in test data that's ok. Just move on.
96           if len(indices_i) == 0:  < ------------------ might revise to: if sum(indices_i) == 0
                    continue

               # If cluster does exist, apply the correction.
                ...

I marked Line # 96 and suggested changing "len()" to "sum()", otw, this if will never run in any case because indices_i is a pd.Series has the same shape with the input cluster.

And if possible, I would like to ask another question about the random effect matrix Z.

In the given example notebook, I noticed that when considering one variance as a random effect, the Z is not simply composed of one column but also has another column of ones. Why are the ones necessary?

Thanks in advance!
meng

An issue about the random state

Hi Dey,
Thank you so much for providing this useful python pachage. I watched your video that is very enlighting, and is curretly using the merf package to handle my data.
I'm just wondering if it is possible to specify the random_state for merf for repetition. I tried to specify random_state for RandomForestRegressor, but it still returns different results every time I run the code.
Below are my codes,
mrf = MERF(fixed_effects_model=RandomForestRegressor(n_estimators=100, n_jobs=-1,random_state = 2),max_iterations=20)

Do you have any idea?
Thanks,
Zixiao

Is there any way to specify nested covariance structure?

Hi, first of all, thanks for making this great library.
Wonder if I can specify nested covariance structure or not.
When I specify nested group structure by passing a data frame including two columns to clusters_train object, it raises error that it can only accept Series, which means only one columns.

So far I could not find how I can specify it.
Thanks for reading this issue!

Add Gradient Boosted Trees as an option for fixed effects learner

Add this in addition to Random Forest as the fixed effects learner.

Why not expose all the parameters of the RandomForestRegressor?

Also, would using the ExtraTreesRegressor also in this framework? It seems simple to just add a base_estimator parameter, allowing the user to use whatever he/she wants?

I can submit a pull request if it makes sense to add these features.

Scikit-Learn compliant MERF

Hello!

First of all, thanks for developing this package and making it open source! MERF is an interesting method, and we would like to use it in one or perhaps more of our projects.

I've been playing around with my own fork of MERF to see what it would take to make MERF scikit-learn compliant. This makes it easier to use it in conjunction with other methods and procedures from sklearn and those based upon sklearn, such as pycaret. It seems to work pretty well. See the changes here

If you are interested in these changes, I could open a PR from my fork. However, this means you would have some breaking changes, and you might want to make a new major release at some point. Let me know if this is something you are interested in.

refs #64

Accounting for dataset imbalance e.g. class_weight = balanced?

Hello,

Is there any way to incorporate the sklearn parameter 'class_weight = "balanced"' into MERF, or something similar?

Thank you!

ValueError: Unable to coerce to Series, length must be 1: given 2854

Hi,
I have found your approach to mixed effects modeling very appealing and therefore wanted to try my data which is about network traffic prediction, a very valuable and in demand use case. But unfortunately, I am encoutering the following error with full stack trace, can you please help me in finding the source of the problem:


ValueError                                Traceback (most recent call last)
<ipython-input-293-b5c7b2f082a6> in <module>
      1 mrf = MERF(max_iterations=500)
----> 2 mrf.fit(X_train_merf, Z_train, clusters_train, y_train)

C:\ProgramData\Anaconda3\envs\shared\lib\site-packages\merf\merf.py in fit(self, X, Z, clusters, y, X_val, Z_val, clusters_val, y_val)
    209 
    210                 # Compute y_star for this cluster and put back in right place
--> 211                 y_star_i = y_i - Z_i.dot(b_hat_i)
    212                 y_star[indices_i] = y_star_i
    213 

C:\ProgramData\Anaconda3\envs\shared\lib\site-packages\pandas\core\ops\__init__.py in f(self, other, axis, level, fill_value)
    645         # TODO: why are we passing flex=True instead of flex=not special?
    646         #  15 tests fail if we pass flex=not special instead
--> 647         self, other = _align_method_FRAME(self, other, axis, flex=True, level=level)
    648 
    649         if isinstance(other, ABCDataFrame):

C:\ProgramData\Anaconda3\envs\shared\lib\site-packages\pandas\core\ops\__init__.py in _align_method_FRAME(left, right, axis, flex, level)
    472 
    473         if right.ndim == 1:
--> 474             right = to_series(right)
    475 
    476         elif right.ndim == 2:

C:\ProgramData\Anaconda3\envs\shared\lib\site-packages\pandas\core\ops\__init__.py in to_series(right)
    463         else:
    464             if len(left.columns) != len(right):
--> 465                 raise ValueError(
    466                     msg.format(req_len=len(left.columns), given_len=len(right))
    467                 )

ValueError: Unable to coerce to Series, length must be 1: given 2854

Here is the my git repo where you can find the complete example
https://github.com/wasifmasood/network_traffic_forecast/blob/master/MERF.ipynb