epigenelabs / pycombat Goto Github PK

pyComBat is a Python 3 implementation of ComBat, one of the most widely used tool for correcting technical biases, called batch effects, in microarray expression data.

License: GNU General Public License v3.0

Python 100.00%

pycombat's People

Contributors

Stargazers

Watchers

pycombat's Issues

Using this code more generally

Hello and thank you for this excellent package. I am trying to use ComBat to correct for batch effects in clinical data instead of gene data, and was hoping to use this code to do it. I have 10 features for each sample in each batch, with 280 samples in batch 1 and 95 samples in batch 2. Currently each batch is a dataframe with rows representing individual samples and columns representing features. What I want to do is apply combat to both of these batches so that batch effects are corrected for. Please note here that my features aren't gene expressions, but generic numerical features instead.

So, given this, I was going to implement pycombat. As pycombat assumes gene expression (rows) for each sample (columns), normalize features to 0 mean and unit variance, transpose my dataframes, , and then apply pycombat?

Thanks

Automatic detection of batch of size = 1

Add a test to detect automatically if there is a batch of size = 1. If so, use automatically mean_only=True.

typos in detailed docs

Three things I noted while following the detailed docs (as a complete newbie):
1)
df_expression = pd.concat([dataset_1,dataset_2,dataset3],join="inner",axis=1)
should be
df_expression = pd.concat([dataset_1,dataset_2,dataset_3],join="inner",axis=1)

I was able to reproduce the graphs by omitting the .transpose() function, but maybe I'm doing something wrong. I downloaded the pickle files from https://github.com/epigenelabs/pyComBat/data/

plt.boxplot(df_expression.transpose()) # had way too many plots, I think genes were the X-axis
plt.boxplot(df_expression) # more closely reproduced "distrib_raw.png"

the same correction applies to:
plt.boxplot(df_corrected.transpose())

It would be nice to show the full code for how you generated the boxplots. I know setting axis names and boxplot colours isn't the primary focus of pyCombat, but it would be really helpful for newbies :) The same sentiment applies to the nice PCA plot examples "PCA_raw.png" and "PCA_corrected.png".

I'd be happy to write up a new detailed usage example once I get my graphs nice and sorted, let's see.

Fit transform and transform

Hi,

thanks for this so interesting code. Would it be possible include a fit_tranform for a train set and a transform for a test set?

Thanks!
Pablo

Unable to install by pip

It was not able to be installed by pip with the following errors:

pip install combat
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(S
SLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
Could not fetch URL https://pypi.org/simple/combat/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/combat/ (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))) - skipping

we can put the condition before the two last computations.

pyComBat/combat/pycombat.py

Line 222 in db89250

g_star.append(np.sum(g*LH)/np.sum(LH))

UnboundLocalError

ZeroDivisionError: division by zero

thanks for the library and your work.
I am facing a weird problem. I am using the pycombat function. However this is the error that I get repeatedly.

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-119-bb601998f8f1> in <module>
      1 # run pyComBat
----> 2 df_corrected = pycombat(df_features_compounds.T, 
      3                         metadata["Donor"].tolist())
      4 
      5 

~/.conda/envs/monai/lib/python3.8/site-packages/combat/pycombat.py in pycombat(data, batch, mod, par_prior, prior_plots, mean_only, ref_batch, precision, **kwargs)
    655     NAs = check_NAs(dat)
    656     if not(NAs):
--> 657         B_hat, grand_mean, var_pooled = calculate_mean_var(
    658             design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    659         stand_mean = calculate_stand_mean(

~/.conda/envs/monai/lib/python3.8/site-packages/combat/pycombat.py in calculate_mean_var(design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    451     else:
    452         grand_mean = np.dot(np.transpose(
--> 453             [i / n_array for i in n_batches]), B_hat[0:n_batch])
    454     # Calculates the general variance
    455     if not NAs:  # NAs not supported

~/.conda/envs/monai/lib/python3.8/site-packages/combat/pycombat.py in <listcomp>(.0)
    451     else:
    452         grand_mean = np.dot(np.transpose(
--> 453             [i / n_array for i in n_batches]), B_hat[0:n_batch])
    454     # Calculates the general variance
    455     if not NAs:  # NAs not supported

ZeroDivisionError: division by zero

I exactly don't know what to do about it. Would you please help me?

combat.py implementation

Hi, just wanted to let you know about an existing Python implementation of combat by @brentp:

https://github.com/brentp/combat.py

which we are using already in Scanpy (https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.combat.html) for batch correction in scRNA-seq data.

Any ideas about the differences between the implementations? Have you considering making a benchmark between them?

local variable 'B_hat' referenced before assignment when dataframe contains NaN

Code:

df = pd.DataFrame(np.random.rand(10, 9))
df.iloc[5, 5] = np.nan
batches = [0]*3+[1]*3+[2]*3
dft_ = pycombat(df, batches)

Causes:

dft_ = pycombat(df, batches)
File ".../miniconda3/lib/python3.8/site-packages/combat/pycombat.py", line 661, in pycombat
B_hat, grand_mean, var_pooled = calculate_mean_var(
File ".../miniconda3/lib/python3.8/site-packages/combat/pycombat.py", line 458, in calculate_mean_var
[i / n_array for i in n_batches]), B_hat[0:n_batch])
UnboundLocalError: local variable 'B_hat' referenced before assignment

No module named 'mpmath'

Code:

from combat.pycombat import pycombat

Causes:

File ".../python3.8/site-packages/combat/pycombat.py", line 30, in
import mpmath as mp
ModuleNotFoundError: No module named 'mpmath'

pycombat for bulk RNAseq dataset?

Hi, from what I understood, pyCombat have been developped and tested on microarray gene expression data, can we use it also for RNA gene expression data ? Thanks!

Degree of freedom for variance calculation

I found that np.var function is used to calculate the variances in the code. Since the np.var function is using ddof=0 by default, we need to fix it to use ddof=1 to make all the equations aligned with the original paper.

I created a PR for it. Please take a look and let me know if you have any thoughts.

PR link: #30

inquiries about ref_batch

Hello, I am a user of your great package.
I have inquiries about use of ref_batch.

I guess the ref_batch option enables us to set the reference batch that indicates data for calculating the general mean (, which makes the selected data are not modified by pycombat).
Firstly, I would like to know if my above understanding is correct or not.

The second one may be derived from incorrect use of your package, but when I give a batch list composed of 0 and 1 like [0, 0, ..., 1, 1] and indicate ref_batch=0, I got an IndexError.
On the other hand, it works well when I used a batch list composed of 1 and 2 and ref_batch=1.
I would be glad if you would have an idea about this situation.

The details are as follows:

C:\Users\{xxxx}\miniconda3\envs\pipenv\lib\site-packages\combat\pycombat.py:358: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  batches = np.asarray(batches)
Found 2 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.

IndexError: arrays used as indices must be of integer (or boolean) type
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-63-012733a94a1e> in <module>
      5 test = pd.concat([test0,test1],axis=1,join='inner')
      6 batch = [0] * test0.shape[1] + [1] * test1.shape[1]
----> 7 res = pycombat(test,batch,par_prior=True,ref_batch=0)
      8 res

~\miniconda3\envs\pipenv\lib\site-packages\combat\pycombat.py in pycombat(data, batch, mod, par_prior, prior_plots, mean_only, ref_batch, precision, **kwargs)
    659     design = treat_covariates(batchmod, mod, ref, n_batch)
    660     NAs = check_NAs(dat)
--> 661     B_hat, grand_mean, var_pooled = calculate_mean_var(
    662         design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    663     stand_mean = calculate_stand_mean(

~\miniconda3\envs\pipenv\lib\site-packages\combat\pycombat.py in calculate_mean_var(design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    460     if not NAs:  # NAs not supported
    461         if ref_batch is not None:  # depending on ref batch
--> 462             ref_dat = np.transpose(np.transpose(dat)[batches[ref]])
    463             var_pooled = np.dot(np.square(ref_dat - np.transpose(np.dot(np.transpose(
    464                 design)[batches[ref]], B_hat))), [1/n_batches[ref]]*n_batches[ref])

IndexError: arrays used as indices must be of integer (or boolean) type```

My environment is briefly Win10 and python 3.9.5.

Best,

Performance in Parallel

Thanks a lot for developing this package! The original R package (sva) is really slow in non-parametric mode, largely because of un-optimized loops and concatenation (repeatedly growing vectors in each loop) in the Monte Carlo function int.eprior. I'm really suffering from it.

However, sva supports parallel computing through BiocParallel, although the parallel computing takes place at batch level. Therefore, if you have much more CPU cores than number of batches (in my case), it won't help much. Their source code: https://github.com/jtleek/sva-devel/blob/123be9b2b9fd7c7cd495fab7d7d901767964ce9e/R/ComBat.R#L263

Does pyComBat supports parallel computing as well? I didn't find the mechanism by skimming the source codes. It will be very helpful if so.

All values in the dataframe are "NAN" after running pycombat

Hello @abdelkaderbehdenna,
After running pycombat according to the pyComBat’s documentation, I got a dataframe with all "NAN" values.
Seems like everything is ok:

Would you please help me with this?
Thank you very much and have a nice day!

epigenelabs / pycombat Goto Github PK

pycombat's People

Contributors

Stargazers

Watchers

Forkers

pycombat's Issues

Recommend Projects

Recommend Topics

Recommend Org