rna-features
is a package used to generate machine-learning features from
RNAseq data. Given a list of dataset directories containing DESeq2 contrast
files (.csv) and a 'tpm.tsv' matrix of gene Transcripts per Million (TPM)
across samples (generated by the
llrnaseq
pipeline), it generates a
feature matrix containing the following features per dataset:
- Gene breadth (
p <= p-value
)- down (
log2FC <= -1
) - neither (
-1 < log2FC < 1
) - up (
log2FC >= 1
)
- down (
- log2FC (
p <= p-value
)- Median Absolute Deviation (MAD)
- Maximum
- Median
- TPM
- MAD
- Maximum
- Median
These features are output as a feature_matrix
file in both .csv
and .pkl
format (the .pkl
file can be loaded as a pandas
dataframe with
pandas.read_pickle(path)
). Below is an output preview:
regulation log2foldchange tpm
down neither up mad max median mad max median
dataset gene
set_1 Solyc00g500063.1 0.0 1.0 0.0 0.000000 0.953245 0.953245 8.412766 54.887642 27.721765
Solyc00g500185.1 0.0 0.0 1.0 0.000000 1.333732 1.333732 0.135050 0.943789 0.254913
Solyc01g005000.3 0.0 1.0 2.0 0.118566 1.097196 1.093001 44.024541 254.986816 108.668376
Solyc01g005010.4 4.0 0.0 0.0 0.439194 -1.201843 -1.577684 13.191743 85.372719 12.014153
Solyc01g005020.3 0.0 1.0 0.0 0.000000 0.649139 0.649139 6.994529 42.430080 18.944556
... ... ... ... ... ... ... ... ... ...
set_2 Solyc12g150103.1 0.0 2.0 3.0 0.245354 1.598051 1.049794 1.223475 7.559584 3.616534
Solyc12g150108.1 1.0 0.0 0.0 0.000000 -23.707473 -23.707473 1.287612 13.105947 0.000000
Solyc12g150113.1 0.0 1.0 4.0 0.251563 1.845714 1.397828 40.746832 193.108032 59.591179
Solyc12g150124.1 0.0 0.0 2.0 0.076378 1.622478 1.546100 0.468325 4.811159 0.703217
Solyc12g150132.1 0.0 0.0 1.0 0.000000 4.130969 4.130969 0.074633 0.551118 0.091994
To install rna-features
, download the latest .whl
binary from the
releases page and install
using pip
(note: the package is not currently installable with python 3.10, as
dependencies such as numpy
have not yet released compatible wheels):
wget https://github.com/SpikyClip/rna-features/releases/download/0.1.1-dev/rna_features-0.1.1-py3-none-any.whl
pip install rna_features-0.1.1-py3-none-any.whl
This will install rna-features
as a python package, and rna-features
will
be available on $PATH
. To test if installation is successful:
rna-features -h
The following help message should appear:
usage: rna-features [-h] [-p p-value] dir [dir ...]
Generates machine-learning features from RNAseq data. Takes a list of
directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' file
(containing a matrix of tpm values of genes against sample) returning a
'feature_matrix.csv' containing gene expression breadth and log2fc/tpm
mad, max and median for each gene.
positional arguments:
dir Dataset directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' matrix file.
optional arguments:
-h, --help show this help message and exit
-p p-value p-value cutoff for filtering log2fc values [default: 0.05]
To use rna-features, specify a list of directories each containing DESeq2
.csv
contrast files and one tpm.tsv
file:
rna-features dataset_1 dataset_2 dataset_3
An optional p-value cutoff can be specified:
rna-features -p 0.005 dataset_1 dataset_2 dataset_3
- The contrast files (
*.csv
) should be in the following format:"", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj" "Solyc01g005000.3",4496.05232181299, 1.09719580776875,0.313072912511878, 3.50460152865228,0.000457291165260712, 0.0115280270712814 "Solyc01g005340.3",540.376944106274, 0.52013987940027,0.170624565359894, 3.04844661906186, 0.0023002777636722, 0.0362570019128406 "Solyc01g005390.3",16.4785747787331,-1.85885261292963,0.471053842373692,-3.94615741496274,7.94154133931579e-05,0.00287540425470711 "Solyc01g005410.4",1181.71130130374, 1.37296624988023,0.394738835793252, 3.47816359928501,0.000504861691439399, 0.0125485785916511
- The tpm matrix (
tpm.tsv
) should be in the tab-delimited following format:gene_id01-0-hr-C1 02-0-hr-C2 03-0-hr-C3 04-0-hr-JA1 Solyc00g500003.1 0.030844 0.011062 0.006824 Solyc00g500041.1 1.515571 1.78357 1.503047 Solyc00g500042.1 0.258916 0.273953 0.248473
NaN
values may occur in theregulation
andlog2foldchange
columns if thetpm.tsv
matrix contains a broader set of genes than those found in the contrast files. SuchNaN
files have to be processed by the user.