Giter Site home page Giter Site logo

epigenelabs / pycombat Goto Github PK

View Code? Open in Web Editor NEW
74.0 74.0 22.0 29.85 MB

pyComBat is a Python 3 implementation of ComBat, one of the most widely used tool for correcting technical biases, called batch effects, in microarray expression data.

License: GNU General Public License v3.0

Python 100.00%

pycombat's People

Contributors

abdelkaderbehdenna avatar aryoepigene avatar epigenemax avatar gianarauz avatar guillaumeap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pycombat's Issues

Using this code more generally

Hello and thank you for this excellent package. I am trying to use ComBat to correct for batch effects in clinical data instead of gene data, and was hoping to use this code to do it. I have 10 features for each sample in each batch, with 280 samples in batch 1 and 95 samples in batch 2. Currently each batch is a dataframe with rows representing individual samples and columns representing features. What I want to do is apply combat to both of these batches so that batch effects are corrected for. Please note here that my features aren't gene expressions, but generic numerical features instead.

So, given this, I was going to implement pycombat. As pycombat assumes gene expression (rows) for each sample (columns), normalize features to 0 mean and unit variance, transpose my dataframes, , and then apply pycombat?

Thanks

typos in detailed docs

Three things I noted while following the detailed docs (as a complete newbie):
1)
df_expression = pd.concat([dataset_1,dataset_2,dataset3],join="inner",axis=1)
should be
df_expression = pd.concat([dataset_1,dataset_2,dataset_3],join="inner",axis=1)

  1. I was able to reproduce the graphs by omitting the .transpose() function, but maybe I'm doing something wrong. I downloaded the pickle files from https://github.com/epigenelabs/pyComBat/data/
plt.boxplot(df_expression.transpose()) # had way too many plots, I think genes were the X-axis
plt.boxplot(df_expression) # more closely reproduced "distrib_raw.png"

the same correction applies to:
plt.boxplot(df_corrected.transpose())

  1. It would be nice to show the full code for how you generated the boxplots. I know setting axis names and boxplot colours isn't the primary focus of pyCombat, but it would be really helpful for newbies :) The same sentiment applies to the nice PCA plot examples "PCA_raw.png" and "PCA_corrected.png".

I'd be happy to write up a new detailed usage example once I get my graphs nice and sorted, let's see.

Fit transform and transform

Hi,

thanks for this so interesting code. Would it be possible include a fit_tranform for a train set and a transform for a test set?

Thanks!
Pablo

Unable to install by pip

It was not able to be installed by pip with the following errors:

pip install combat
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(S
SLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))': /simple/combat/
Could not fetch URL https://pypi.org/simple/combat/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/combat/ (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')"))) - skipping

ZeroDivisionError: division by zero

Hi

thanks for the library and your work.
I am facing a weird problem. I am using the pycombat function. However this is the error that I get repeatedly.

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-119-bb601998f8f1> in <module>
      1 # run pyComBat
----> 2 df_corrected = pycombat(df_features_compounds.T, 
      3                         metadata["Donor"].tolist())
      4 
      5 

~/.conda/envs/monai/lib/python3.8/site-packages/combat/pycombat.py in pycombat(data, batch, mod, par_prior, prior_plots, mean_only, ref_batch, precision, **kwargs)
    655     NAs = check_NAs(dat)
    656     if not(NAs):
--> 657         B_hat, grand_mean, var_pooled = calculate_mean_var(
    658             design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    659         stand_mean = calculate_stand_mean(

~/.conda/envs/monai/lib/python3.8/site-packages/combat/pycombat.py in calculate_mean_var(design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    451     else:
    452         grand_mean = np.dot(np.transpose(
--> 453             [i / n_array for i in n_batches]), B_hat[0:n_batch])
    454     # Calculates the general variance
    455     if not NAs:  # NAs not supported

~/.conda/envs/monai/lib/python3.8/site-packages/combat/pycombat.py in <listcomp>(.0)
    451     else:
    452         grand_mean = np.dot(np.transpose(
--> 453             [i / n_array for i in n_batches]), B_hat[0:n_batch])
    454     # Calculates the general variance
    455     if not NAs:  # NAs not supported

ZeroDivisionError: division by zero

I exactly don't know what to do about it. Would you please help me?

local variable 'B_hat' referenced before assignment when dataframe contains NaN

Code:

df = pd.DataFrame(np.random.rand(10, 9))
df.iloc[5, 5] = np.nan
batches = [0]*3+[1]*3+[2]*3
dft_ = pycombat(df, batches)

Causes:

dft_ = pycombat(df, batches)
File ".../miniconda3/lib/python3.8/site-packages/combat/pycombat.py", line 661, in pycombat
B_hat, grand_mean, var_pooled = calculate_mean_var(
File ".../miniconda3/lib/python3.8/site-packages/combat/pycombat.py", line 458, in calculate_mean_var
[i / n_array for i in n_batches]), B_hat[0:n_batch])
UnboundLocalError: local variable 'B_hat' referenced before assignment

No module named 'mpmath'

Code:

from combat.pycombat import pycombat

Causes:

File ".../python3.8/site-packages/combat/pycombat.py", line 30, in
import mpmath as mp
ModuleNotFoundError: No module named 'mpmath'

pycombat for bulk RNAseq dataset?

Hi, from what I understood, pyCombat have been developped and tested on microarray gene expression data, can we use it also for RNA gene expression data ? Thanks!

Degree of freedom for variance calculation

I found that np.var function is used to calculate the variances in the code. Since the np.var function is using ddof=0 by default, we need to fix it to use ddof=1 to make all the equations aligned with the original paper.

I created a PR for it. Please take a look and let me know if you have any thoughts.

PR link: #30

inquiries about ref_batch

Hello, I am a user of your great package.
I have inquiries about use of ref_batch.

I guess the ref_batch option enables us to set the reference batch that indicates data for calculating the general mean (, which makes the selected data are not modified by pycombat).
Firstly, I would like to know if my above understanding is correct or not.

The second one may be derived from incorrect use of your package, but when I give a batch list composed of 0 and 1 like [0, 0, ..., 1, 1] and indicate ref_batch=0, I got an IndexError.
On the other hand, it works well when I used a batch list composed of 1 and 2 and ref_batch=1.
I would be glad if you would have an idea about this situation.

The details are as follows:

C:\Users\{xxxx}\miniconda3\envs\pipenv\lib\site-packages\combat\pycombat.py:358: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  batches = np.asarray(batches)
Found 2 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.

IndexError: arrays used as indices must be of integer (or boolean) type
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-63-012733a94a1e> in <module>
      5 test = pd.concat([test0,test1],axis=1,join='inner')
      6 batch = [0] * test0.shape[1] + [1] * test1.shape[1]
----> 7 res = pycombat(test,batch,par_prior=True,ref_batch=0)
      8 res

~\miniconda3\envs\pipenv\lib\site-packages\combat\pycombat.py in pycombat(data, batch, mod, par_prior, prior_plots, mean_only, ref_batch, precision, **kwargs)
    659     design = treat_covariates(batchmod, mod, ref, n_batch)
    660     NAs = check_NAs(dat)
--> 661     B_hat, grand_mean, var_pooled = calculate_mean_var(
    662         design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    663     stand_mean = calculate_stand_mean(

~\miniconda3\envs\pipenv\lib\site-packages\combat\pycombat.py in calculate_mean_var(design, batches, ref, dat, NAs, ref_batch, n_batches, n_batch, n_array)
    460     if not NAs:  # NAs not supported
    461         if ref_batch is not None:  # depending on ref batch
--> 462             ref_dat = np.transpose(np.transpose(dat)[batches[ref]])
    463             var_pooled = np.dot(np.square(ref_dat - np.transpose(np.dot(np.transpose(
    464                 design)[batches[ref]], B_hat))), [1/n_batches[ref]]*n_batches[ref])

IndexError: arrays used as indices must be of integer (or boolean) type```

My environment is briefly Win10 and python 3.9.5.

Best,

Performance in Parallel

Thanks a lot for developing this package! The original R package (sva) is really slow in non-parametric mode, largely because of un-optimized loops and concatenation (repeatedly growing vectors in each loop) in the Monte Carlo function int.eprior. I'm really suffering from it.

However, sva supports parallel computing through BiocParallel, although the parallel computing takes place at batch level. Therefore, if you have much more CPU cores than number of batches (in my case), it won't help much. Their source code: https://github.com/jtleek/sva-devel/blob/123be9b2b9fd7c7cd495fab7d7d901767964ce9e/R/ComBat.R#L263

Does pyComBat supports parallel computing as well? I didn't find the mechanism by skimming the source codes. It will be very helpful if so.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.