Giter Site home page Giter Site logo

rna-features's Introduction

rna-features

rna-features is a package used to generate machine-learning features from RNAseq data. Given a list of dataset directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' matrix of gene Transcripts per Million (TPM) across samples (generated by the llrnaseq pipeline), it generates a feature matrix containing the following features per dataset:

  • Gene breadth (p <= p-value)
    • down (log2FC <= -1)
    • neither (-1 < log2FC < 1)
    • up (log2FC >= 1)
  • log2FC (p <= p-value)
    • Median Absolute Deviation (MAD)
    • Maximum
    • Median
  • TPM
    • MAD
    • Maximum
    • Median

These features are output as a feature_matrix file in both .csv and .pkl format (the .pkl file can be loaded as a pandas dataframe with pandas.read_pickle(path)). Below is an output preview:

                         regulation              log2foldchange                              tpm                        
                               down neither   up            mad        max     median        mad         max      median
dataset gene                                                                                                            
set_1   Solyc00g500063.1        0.0     1.0  0.0       0.000000   0.953245   0.953245   8.412766   54.887642   27.721765
        Solyc00g500185.1        0.0     0.0  1.0       0.000000   1.333732   1.333732   0.135050    0.943789    0.254913
        Solyc01g005000.3        0.0     1.0  2.0       0.118566   1.097196   1.093001  44.024541  254.986816  108.668376
        Solyc01g005010.4        4.0     0.0  0.0       0.439194  -1.201843  -1.577684  13.191743   85.372719   12.014153
        Solyc01g005020.3        0.0     1.0  0.0       0.000000   0.649139   0.649139   6.994529   42.430080   18.944556
...                             ...     ...  ...            ...        ...        ...        ...         ...         ...
set_2   Solyc12g150103.1        0.0     2.0  3.0       0.245354   1.598051   1.049794   1.223475    7.559584    3.616534
        Solyc12g150108.1        1.0     0.0  0.0       0.000000 -23.707473 -23.707473   1.287612   13.105947    0.000000
        Solyc12g150113.1        0.0     1.0  4.0       0.251563   1.845714   1.397828  40.746832  193.108032   59.591179
        Solyc12g150124.1        0.0     0.0  2.0       0.076378   1.622478   1.546100   0.468325    4.811159    0.703217
        Solyc12g150132.1        0.0     0.0  1.0       0.000000   4.130969   4.130969   0.074633    0.551118    0.091994

Installation

To install rna-features, download the latest .whl binary from the releases page and install using pip(note: the package is not currently installable with python 3.10, as dependencies such as numpy have not yet released compatible wheels):

wget https://github.com/SpikyClip/rna-features/releases/download/0.1.1-dev/rna_features-0.1.1-py3-none-any.whl

pip install rna_features-0.1.1-py3-none-any.whl

This will install rna-features as a python package, and rna-features will be available on $PATH. To test if installation is successful:

rna-features -h

The following help message should appear:

usage: rna-features [-h] [-p p-value] dir [dir ...]

Generates machine-learning features from RNAseq data. Takes a list of 
directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' file 
(containing a matrix of tpm values of genes against sample) returning a 
'feature_matrix.csv' containing gene expression breadth and log2fc/tpm 
mad, max and median for each gene.

positional arguments:
  dir         Dataset directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' matrix file.

optional arguments:
  -h, --help  show this help message and exit
  -p p-value  p-value cutoff for filtering log2fc values [default: 0.05]

Usage

To use rna-features, specify a list of directories each containing DESeq2 .csv contrast files and one tpm.tsv file:

rna-features dataset_1 dataset_2 dataset_3

An optional p-value cutoff can be specified:

rna-features -p 0.005 dataset_1 dataset_2 dataset_3

Additional Notes

  • The contrast files (*.csv) should be in the following format:
                    "",      "baseMean", "log2FoldChange",          "lfcSE",           "stat",            "pvalue",              "padj"
    "Solyc01g005000.3",4496.05232181299, 1.09719580776875,0.313072912511878, 3.50460152865228,0.000457291165260712, 0.0115280270712814
    "Solyc01g005340.3",540.376944106274, 0.52013987940027,0.170624565359894, 3.04844661906186,  0.0023002777636722, 0.0362570019128406
    "Solyc01g005390.3",16.4785747787331,-1.85885261292963,0.471053842373692,-3.94615741496274,7.94154133931579e-05,0.00287540425470711
    "Solyc01g005410.4",1181.71130130374, 1.37296624988023,0.394738835793252, 3.47816359928501,0.000504861691439399, 0.0125485785916511
    
  • The tpm matrix (tpm.tsv) should be in the tab-delimited following format:
    gene_id01-0-hr-C1	02-0-hr-C2	03-0-hr-C3	04-0-hr-JA1
    Solyc00g500003.1	0.030844	0.011062	0.006824
    Solyc00g500041.1	1.515571	1.78357	1.503047
    Solyc00g500042.1	0.258916	0.273953	0.248473
    
  • NaN values may occur in the regulation and log2foldchange columns if the tpm.tsv matrix contains a broader set of genes than those found in the contrast files. Such NaN files have to be processed by the user.

rna-features's People

Contributors

spikyclip avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

lewsey-lab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.