Giter Site home page Giter Site logo

bioinfokit's Introduction

DOI PyPI version Downloads Build Status Anaconda-Server Badge

The bioinfokit toolkit aimed to provide various easy-to-use functionalities to analyze,
visualize, and interpret the biological data generated from genome-scale omics experiments.

How to install:

bioinfokit requires

  • Python 3
  • NumPy
  • scikit-learn
  • seaborn
  • pandas
  • matplotlib
  • SciPy
  • matplotlib_venn

bioinfokit can be installed using pip, easy_install and git.

latest bioinfokit version: PyPI version

Install using pip for Python 3 (easiest way)

# install
pip install bioinfokit

# upgrade to latest version
pip install bioinfokit --upgrade

# uninstall 
pip uninstall bioinfokit

Install using easy_install for Python 3 (easiest way)

# install latest version
easy_install bioinfokit

# specific version
easy_install bioinfokit==0.3

# uninstall 
pip uninstall bioinfokit

Install using conda

conda install -c bioconda bioinfokit

Install using git

# download and install bioinfokit (Tested on Linux, Mac, Windows) 
git clone https://github.com/reneshbedre/bioinfokit.git
cd bioinfokit
python setup.py install

Check the version of bioinfokit

>>> import bioinfokit
>>> bioinfokit.__version__
'0.4'

How to cite bioinfokit?

  • Renesh Bedre. (2020, March 5). reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit. Zenodo. http://doi.org/10.5281/zenodo.3698145.
  • Additionally check Zenodo to cite specific version of bioinfokit

Support

If you enjoy bioinfokit, consider supporting me,

Buy Me A Coffee

Getting Started

Gene expression analysis

Volcano plot

latest update v2.0.8

bioinfokit.visuz.GeneExpression.volcano(df, lfc, pv, lfc_thr, pv_thr, color, valpha, geneid, genenames, gfont, dim, r, ar, dotsize, markerdot, sign_line, gstyle, show, figtype, axtickfontsize, axtickfontname, axlabelfontsize, axlabelfontname, axxlabel, axylabel, xlm, ylm, plotlegend, legendpos, figname, legendanchor, legendlabels, theme)

Parameters Description
df Pandas dataframe table having atleast gene IDs, log fold change, P-values or adjusted P-values columns
lfc Name of a column having log or absolute fold change values [string][default:logFC]
pv Name of a column having P-values or adjusted P-values [string][default:p_values]
lfc_thr Log fold change cutoff for up and downregulated genes [Tuple or list][default:(1.0, 1.0)]
pv_thr p value or adjusted p value cutoff for up and downregulated genes [Tuple or list][default:(0.05, 0.05)]
color Tuple of three colors [Tuple or list][default: color=("green", "grey", "red")]
valpha Transparency of points on volcano plot [float (between 0 and 1)][default: 1.0]
geneid Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
genenames Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to "deg" it will label all genes defined by lfc_thr and pv_thr [string, tuple, dict][default: None]
gfont Font size for genenames [float][default: 10.0]. gfont not compatible with gstyle=2.
dim Figure size [Tuple of two floats (width, height) in inches][default: (5, 5)]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
ar Rotation of X and Y-axis ticks labels [float][default: 90]
dotsize The size of the dots in the plot [float][default: 8]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
sign_line Show grid lines on plot with defined log fold change (lfc_thr) and P-value (pv_thr) threshold value [True or False][default:False]
gstyle Style of the text for genenames. 1 for default text and 2 for box text [int][default: 1]
show Show the figure on console instead of saving in current folder [True or False][default:False]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: 'Arial']
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
xlm Range of ticks to plot on X-axis [float (left, right, interval)][default: None]
ylm Range of ticks to plot on Y-axis [float (bottom, top, interval)][default: None]
plotlegend plot legend on volcano plot [True or False][default:False]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:"best"]
figname name of figure [string ][default:"volcano"]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
legendlabels legend label names. If you provide custom label names keep the same order of label names as default [list][default:['significant up', 'not significant', 'significant down']]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

Volcano plot image in same directory (volcano.png) Working example

Inverted Volcano plot

latest update v2.0.8

bioinfokit.visuz.GeneExpression.involcano(table, lfc, pv, lfc_thr, pv_thr, color, valpha, geneid, genenames, gfont, gstyle, dotsize, markerdot, r, dim, show, figtype, axxlabel, axylabel, axlabelfontsize, axtickfontsize, axtickfontname, plotlegend, legendpos, legendanchor, figname, legendlabels, ar, theme)

Parameters Description
table Pandas dataframe table having atleast gene IDs, log fold change, P-values or adjusted P-values
lfc Name of a column having log fold change values [default:logFC]
pv Name of a column having P-values or adjusted P-values [default:p_values]
lfc_thr Log fold change cutoff for up and downregulated genes [Tuple or list] [default:(1.0, 1.0)]
pv_thr p value or adjusted p value cutoff for up and downregulated genes [Tuple or list] [default:(0.05, 0.05)]
color Tuple of three colors [Tuple or list][default: color=("green", "grey", "red")]
valpha Transparency of points on volcano plot [float (between 0 and 1)][default: 1.0]
geneid Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
genenames Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to "deg" it will label all genes defined by lfc_thr and pv_thr [string, Tuple, dict][default: None]
gfont Font size for genenames [float][default: 10.0]
gstyle Style of the text for genenames. 1 for default text and 2 for box text [int][default: 1]
dotsize The size of the dots in the plot [float][default: 8]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
dim Figure size [Tuple of two floats (width, height) in inches][default: (5, 5)]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: 'Arial']
plotlegend plot legend on inverted volcano plot [True or False][default:False]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:"best"]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
figname name of figure [string ][default:"involcano"]
legendlabels legend label names. If you provide custom label names keep the same order of label names as default [list][default:['significant up', 'not significant', 'significant down']]
ar Rotation of X and Y-axis ticks labels [float][default: 90]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

Inverted volcano plot image in same directory (involcano.png) Working example

MA plot

latest update v2.0.7

bioinfokit.visuz.GeneExpression.ma(df, lfc, ct_count, st_count, pv, basemean, lfc_thr, color, dim, dotsize, show, r, valpha, figtype, axxlabel, axylabel, axlabelfontsize, axtickfontsize, axtickfontname, xlm, ylm, fclines, fclinescolor, legendpos, legendanchor, figname, legendlabels, plotlegend, ar, theme, geneid, genenames, gfont, gstyle, title)

Parameters Description
df Pandas dataframe table having atleast gene IDs, log fold change, and normalized counts (control and treatment) columns
lfc Name of a column having log fold change values [default:"logFC"]
ct_count Name of a column having count values for control sample.Ignored if basemean provided [default:"value1"]
st_count Name of a column having count values for treatment sample. Ignored if basemean provided [default:"value2"]
pv Name of a column having p values or adjusted p values
basemean Basemean (mean of normalized counts) from DESeq2 results
lfc_thr Log fold change cutoff for up and downregulated genes [Tuple or list][default:(1.0, 1.0)]
color Tuple of three colors [Tuple or list][default: ("green", "grey", "red")]
dotsize The size of the dots in the plot [float][default: 8]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
valpha Transparency of points on plot [float (between 0 and 1)][default: 1.0]
dim Figure size [Tuple of two floats (width, height) in inches][default: (5, 5)]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: 'Arial']
xlm Range of ticks to plot on X-axis [float (left, right, interval)][default: None]
ylm Range of ticks to plot on Y-axis [float (bottom, top, interval)][default: None]
fclines draw log fold change threshold lines as defines by lfc [True or False][default:False]
fclinescolor color of fclines [string][default: '#2660a4']
plotlegend plot legend on MA plot [True or False][default:False]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:"best"]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
figname name of figure [string ][default:"ma"]
legendlabels legend label names. If you provide custom label names keep the same order of label names as default [list][default:['significant up', 'not significant', 'significant down']]
ar Rotation of X and Y-axis ticks labels [float][default: 90]
theme Change background theme. If theme set to dark_background, the dark background will be produced instead of default white. See more themes here [string][default:'None']
geneid Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
genenames Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to "deg" it will label all genes defined by lfc_thr and pv_thr [string, Tuple, dict][default: None]
gfont Font size for genenames [float][default: 10.0]
gstyle Style of the text for genenames. 1 for default text and 2 for box text [int][default: 1]
title Add main title to the plot [string][default: None]

Returns:

MA plot image in same directory (ma.png)

Working example

Heatmap

latest update v2.0.1

bioinfokit.visuz.gene_exp.hmap(table, cmap='seismic', scale=True, dim=(6, 8), rowclus=True, colclus=True, zscore=None, xlabel=True, ylabel=True, tickfont=(12, 12), show, r, figtype, figname, theme)

Parameters Description
file CSV delimited data file. It should not have NA or missing values
cmap Color Palette for heatmap [string][default: 'seismic']
scale Draw a color key with heatmap [boolean (True or False)][default: True]
dim heatmap figure size [Tuple of two floats (width, height) in inches][default: (6, 8)]
rowclus Draw hierarchical clustering for rows [boolean (True or False)][default: True]
colclus Draw hierarchical clustering for columns [boolean (True or False)][default: True]
zscore Z-score standardization of row (0) or column (1). It works when clus is True. [None, 0, 1][default: None]
xlabel Plot X-label [boolean (True or False)][default: True]
ylabel Plot Y-label [boolean (True or False)][default: True]
tickfont Fontsize for X and Y-axis tick labels [Tuple of two floats][default: (14, 14)]
show Show the figure on console instead of saving in current folder [True or False][default:False]
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
figname name of figure [string ][default:"heatmap"]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

heatmap plot (heatmap.png, heatmap_clus.png)

Working example

Clustering analysis

Scree plot

latest update v2.0.1

bioinfokit.visuz.cluster.screeplot(obj, axlabelfontsize, axlabelfontname, axxlabel, axylabel, figtype, r, show, dim, theme)

Parameters Description
obj list of component name and component variance
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

Scree plot image (screeplot.png will be saved in same directory)

Working Example

Principal component analysis (PCA) loadings plots

latest update v2.0.1

bioinfokit.visuz.cluster.pcaplot(x, y, z, labels, var1, var2, var3, axlabelfontsize, axlabelfontname, figtype, r, show, plotlabels, dim, theme)

Parameters Description
x loadings (correlation coefficient) for principal component 1 (PC1)
y loadings (correlation coefficient) for principal component 2 (PC2)
z loadings (correlation coefficient) for principal component 3 (PC2)
labels original variables labels from dataframe used for PCA
var1 Proportion of PC1 variance [float (0 to 1)]
var2 Proportion of PC2 variance [float (0 to 1)]
var3 Proportion of PC3 variance [float (0 to 1)]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
plotlabels Plot labels as defined by labels parameter [True or False][default:True]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

PCA loadings plot 2D and 3D image (pcaplot_2d.png and pcaplot_3d.png will be saved in same directory)

Working Example

Principal component analysis (PCA) biplots

latest update v2.0.2

bioinfokit.visuz.cluster.biplot(cscore, loadings, labels, var1, var2, var3, axlabelfontsize, axlabelfontname, figtype, r, show, markerdot, dotsize, valphadot, colordot, arrowcolor, valphaarrow, arrowlinestyle, arrowlinewidth, centerlines, colorlist, legendpos, datapoints, dim, theme)

Parameters Description
cscore principal component scores (obtained from PCA().fit_transfrom() function in sklearn.decomposition)
loadings loadings (correlation coefficient) for principal components
labels original variables labels from dataframe used for PCA
var1 Proportion of PC1 variance [float (0 to 1)]
var2 Proportion of PC2 variance [float (0 to 1)]
var3 Proportion of PC3 variance [float (0 to 1)]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
markerdot Shape of the dot on plot. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
dotsize The size of the dots in the plot [float][default: 6]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
colordot Color of dots on plot [string or list ][default:"#4a4e4d"]
arrowcolor Color of the arrow [string ][default:"#fe8a71"]
valphaarrow Transparency of the arrow [float (between 0 and 1)][default: 1]
arrowlinestyle line style of the arrow. check more styles at https://matplotlib.org/3.1.0/gallery/lines_bars_and_markers/linestyles.html [string][default: '-']
arrowlinewidth line width of the arrow [float][default: 1.0]
centerlines draw center lines at x=0 and y=0 for 2D plot [bool (True or False)][default: True]
colorlist list of the categories to assign the color [list][default:None]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:"best"]
datapoints plot data points on graph [bool (True or False)][default: True]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

PCA biplot 2D and 3D image (biplot_2d.png and biplot_3d.png will be saved in same directory)

Working Example

t-SNE plot

latest update v2.0.1

bioinfokit.visuz.cluster.tsneplot(score, colorlist, axlabelfontsize, axlabelfontname, figtype, r, show, markerdot, dotsize, valphadot, colordot, dim, figname, legendpos, legendanchor, theme)

Parameters Description
score t-SNE component embeddings (obtained from TSNE().fit_transfrom() function in sklearn.manifold)
colorlist list of the categories to assign the color [list][default:None]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
r Figure resolution in dpi [int][default: 300]
show Show the figure on console instead of saving in current folder [True or False][default:False]
markerdot Shape of the dot on plot. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
dotsize The size of the dots in the plot [float][default: 6]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
colordot Color of dots on plot [string or list ][default:"#4a4e4d"]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:"best"]
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
figname name of figure [string ][default:"tsne_2d"]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

t-SNE 2D image (tsne_2d.png will be saved in same directory)

Working Example

Normalization

RPM or CPM normalization

latest update v0.8.9

Normalize raw gene expression counts into Reads per million mapped reads (RPM) or Counts per million mapped reads (CPM)

bioinfokit.analys.norm.cpm(df)

Parameters Description
df Pandas dataframe containing raw gene expression values. Genes with missing expression values (NA) will be dropped.

Returns:

RPM or CPM normalized Pandas dataframe as class attributes (cpm_norm)

Working Example

RPKM or FPKM normalization

latest update v0.9

Normalize raw gene expression counts into Reads per kilo base per million mapped reads (RPKM) or Fragments per kilo base per million mapped reads (FPKM)

bioinfokit.analys.norm.rpkm(df, gl)

Parameters Description
df Pandas dataframe containing raw gene expression values. Genes with missing expression or gene length values (NA) will be dropped.
gl Name of a column having gene length in bp [string][default: None]

Returns:

RPKM or FPKM normalized Pandas dataframe as class attributes (rpkm_norm)

Working Example

TPM normalization

latest update v0.9.1

Normalize raw gene expression counts into Transcript per million (TPM)

bioinfokit.analys.norm.tpm(df, gl)

Parameters Description
df Pandas dataframe containing raw gene expression values. Genes with missing expression or gene length values (NA) will be dropped.
gl Name of a column having gene length in bp [string][default: None]

Returns:

TPM normalized Pandas dataframe as class attributes (tpm_norm)

Working Example

Variant analysis

Manhattan plot

latest update v2.0.1

bioinfokit.visuz.marker.mhat(df, chr, pv, log_scale, color, dim, r, ar, gwas_sign_line, gwasp, dotsize, markeridcol, markernames, gfont, valpha, show, figtype, axxlabel, axylabel, axlabelfontsize, ylm, gstyle, figname, theme)

Parameters Description
df Pandas dataframe object with atleast SNP, chromosome, and P-values columns
chr Name of a column having chromosome numbers [string][default:None]
pv Name of a column having P-values. Must be numeric column [string][default:None]
log_scale Change the values provided in pv column to minus log10 scale. If set to False, the original values in pv will be used. This is useful in case of Fst values. [Boolean (True or False)][default:True]
color List the name of the colors to be plotted. It can accept two alternate colors or the number colors equal to chromosome number. If nothing (None) provided, it will randomly assign the color to each chromosome [list][default:None]
gwas_sign_line Plot statistical significant threshold line defined by option gwasp [Boolean (True or False)][default: False]
gwasp Statistical significant threshold to identify significant SNPs [float][default: 5E-08]
dotsize The size of the dots in the plot [float][default: 8]
markeridcol Name of a column having SNPs. This is necessary for plotting SNP names on the plot [string][default: None]
markernames The list of the SNPs to display on the plot. These SNP should be present in SNP column. Additionally, it also accepts the dict of SNPs and its associated gene name. If this option set to True, it will label all SNPs with P-value significant score defined by gwasp [string, list, Tuple, dict][default: True]
gfont Font size for SNP names to display on the plot [float][default: 8]. gfont not compatible with gstyle=2.
valpha Transparency of points on plot [float (between 0 and 1)][default: 1.0]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
r Figure resolution in dpi [int][default: 300]
ar Rotation of X-axis labels [float][default: 90]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
show Show the figure on console instead of saving in current folder [Boolean (True or False)][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
ylm Range of ticks to plot on Y-axis [float Tuple (bottom, top, interval)][default: None]
gstyle Style of the text for markernames. 1 for default text and 2 for box text [int][default: 1]
figname name of figure [string][default:"manhattan"]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

Manhattan plot image in same directory (Manhattan.png)

Working example

Variant annotation

latest update v0.9.3

Assign genetic features and function to the variants in VCF file

bioinfokit.analys.marker.vcf_anot(file, id, gff_file, anot_attr)

Parameters Description
file VCF file
id chromosome id column in VCF file [string][default='#CHROM']
gff_file GFF3 genome annotation file
anot_attr Gene function tag in attributes field of GFF3 file

Returns:

Tab-delimited text file with annotation (annotated text file will be saved in same directory)

Working Example

Concatenate VCF files

latest update v0.9.4

Concatenate multiple VCF files into single VCF file (for example, VCF files for each chromosome)

bioinfokit.analys.marker.concatvcf(file)

Parameters Description
file Multiple vcf files separated by comma

Returns:

Concatenated VCF file (concat_vcf.vcf)

Working example

Split VCF file

bioinfokit.analys.marker.splitvcf(file)

Split single VCF file containing variants for all chromosomes into individual file containing variants for each chromosome

Parameters Description
file VCF file to split
id chromosome id column in VCF file [string][default='#CHROM']

Returns:

VCF files for each chromosome

Working example

High-throughput sequence analysis

FASTQ batch downloads from SRA database

latest update v0.9.7

bioinfokit.analys.fastq.sra_bd(file, t, other_opts)

FASTQ files will be downloaded using fasterq-dump. Make sure you have the latest version of the NCBI SRA toolkit (version 2.10.8) is installed and binaries are added to the system path

Parameters Description
file List of SRA accessions for batch download. All accession must be separated by a newline in the file.
t Number of threads for parallel run [int][default=4]
other_opts Provide other relevant options for fasterq-dump [str][default=None]
Provide the options as a space-separated string. You can get a detailed option for fasterq-dump using the -help option.

Returns:

FASTQ files for each SRA accession in the current directory unless specified by other_opts

Description and working example

FASTQ quality format detection

bioinfokit.analys.format.fq_qual_var(file)

Parameters Description
file FASTQ file to detect quality format [deafult: None]

Returns:

Quality format encoding name for FASTQ file (Supports only Sanger, Illumina 1.8+ and Illumina 1.3/1.4)

Working Example

Sequencing coverage

latest update v0.9.7

bioinfokit.analys.fastq.seqcov(file, gs)

Parameters Description
file FASTQ file
gs Genome size in Mbp

Returns:

Sequencing coverage of the given FASTQ file

Description and Working example

Split the sequence into smaller subsequences

latest update v2.0.6

bioinfokit.analys.Fasta.split_seq(seq, seq_size, seq_overlap, any_cond, outfmt)

Parameters Description
seq Input sequence [string]
seq_size subsequence size [int][default: 3]
seq_overlap Split the sequence in overlap mode [bool][default: True]
any_cond Split sequence based on a condition. Note yet defined.
outfmt Output format for the subsequences. If parameter set to 'fasta', the file will be saved in same folder with name output_chunks.fasta ['list' or 'fasta'][default: 'list']

Returns:

Subsequences in list or fasta file (output_chunks.fasta) format

Description and Working example

Reverse complement of DNA sequence

latest update v2.1.1

bioinfokit.analys.Fasta.rev_com(sequence)

Parameters Description
seq DNA sequence to perform reverse complement
file DNA sequence in a fasta file

Returns:

Reverse complement of original DNA sequence

Working example

File format conversions

bioinfokit.analys.format

Function Parameters Description
bioinfokit.analys.format.fqtofa(file) FASTQ file Convert FASTQ file into FASTA format
bioinfokit.analys.format.hmmtocsv(file) HMM file Convert HMM text output (from HMMER tool) to CSV format
bioinfokit.analys.format.tabtocsv(file) TAB file Convert TAB file to CSV format
bioinfokit.analys.format.csvtotab(file) CSV file Convert CSV file to TAB format

Returns:

Output will be saved in same directory

Working example

GFF3 to GTF file format conversion

latest update v1.0.1

bioinfokit.analys.gff.gff_to_gtf(file, trn_feature_name)

Parameters Description
file GFF3 genome annotation file
trn_feature_name Name of the feature (column 3 of GFF3 file) of RNA transcripts if other than 'mRNA' or 'transcript'

Returns:

GTF format genome annotation file (file.gtf will be saved in same directory)

Working Example

Bioinformatics file readers and processing (FASTA, FASTQ, and VCF)

latest update v2.0.4

Function Parameters Description
bioinfokit.analys.Fasta.fasta_reader(file) FASTA file FASTA file reader
bioinfokit.analys.fastq.fastq_reader(file) FASTQ file FASTQ file reader
bioinfokit.analys.marker.vcfreader(file) VCF file VCF file reader

Returns:

File generator object (can be iterated only once) that can be parsed for the record

Description and working example

Extract subsequence from FASTA files

latest update v2.0.4

bioinfokit.analys.Fasta.ext_subseq(file, id, st, end, strand)

Extract the subsequence of specified region from FASTA file. If the target subsequence region is on minus strand. the reverse complementary of subsequence will be printed.

Parameters Description
file FASTA file [file]
id The ID of sequence from FASTA file to extract the subsequence [string]
st Start integer coordinate of subsequnece [int]
end End integer coordinate of subsequnece [int]
strand Strand of the subsequence ['plus' or 'minus'][default: 'plus']

Returns:

Subsequence to stdout

Extract sequences from FASTA file

latest update v2.1.3

bioinfokit.analys.Fasta.extract_seq(file, id)

Extract the sequences from FASTA file based on the list of sequence IDs provided from other file

Parameters Description
file FASTA file [file]
id List of sequence IDs separated by new line. This file can also contain the ID, start and end coordinates separated by TAB [file]

Returns:

Sequences extracted from FASTA file based on the given IDs provided in id file. Output FASTA file will be saved as output.fasta in current working directory.

Description and working example

Split FASTA file into multiple FASTA files

latest update v2.0.4

bioinfokit.analys.Fasta.split_fasta(file, n, bases_per_line)

Split one big FASTA file into multiple smaller FASTA files

Parameters Description
file FASTA file [file]
n Number of FASTA files to split the big FASTA file [int][default: 2]
bases_per_line Number of bases per line for ouput FASTA files [int][default: 60]

Returns:

Number of smaller FASTA files with prefix output (output_0.fasta, output_1.fasta and so on)

Convert multi-line FASTA into single-line FASTA

latest update v2.1.2

bioinfokit.analys.Fasta.multi_to_single_line(file)

Convert multi-line FASTA (where sequences are on multi lines) into single-line FASTA (where sequences are in single line)

Parameters Description
file FASTA file [file]

Returns:

Single line FASTA (output.fasta). Output FASTA file will be saved as output.fasta in current working directory.

Description and working example

Merge counts files from featureCounts

latest update v2.0.5

bioinfokit.analys.HtsAna.merge_featureCount(pattern, gene_column_name)

Merge counts files generated from featureCounts when it runs individually on large samples. The count files must be in same folder and should end with .txt file extension.

Parameters Description
pattern file name pattern for each count file [default: '*.txt']
gene_column_name gene id column name for feature and meta-features [default: 'Geneid']

Returns:

Merge count file (gene_matrix_count.csv) in same folder

Split BED file by chromosome

latest update v2.0.9

bioinfokit.analys.HtsAna.split_bed(bed)

Split the BED file by chromosome names

Parameters Description
bed Input BED file [default: None]

Returns:

BED file for each chromosome (files will be saved in same directory)

Working example

Max and Min sequence lengths from Fasta

latest update v2.1.4

bioinfokit.analys.Fasta.max_min_len(fasta)

Find Max and Min sequence lengths from Fasta

Parameters Description
fasta Input Fasta file [default: None]

Returns:

Max and Min sequence lengths from Fasta file

Working example

Functional enrichment analysis

Gene family enrichment analysis (GenFam)

latest update v1.0.0

bioinfokit.analys.genfam.fam_enrich(id_file, species, id_type, stat_sign_test, multi_test_corr, min_map_ids, alpha)

GenFam is a comprehensive classification and enrichment analysis tool for plant genomes. It provides a unique way to characterize the large-scale gene datasets such as those from transcriptome analysis (read GenFam paper for more details)

Parameters Description
id_file Text file containing the list of gene IDs to analyze using GenFam. IDs must be separated by newline.
species Plant species ID for GenFam analysis. All plant species ID provided here
id_type Plant species ID type
1: Phytozome locus ID
2: Phytozome transcript ID
3: Phytozome PAC ID
stat_sign_test Statistical significance test for enrichment analysis [default=1].
1: Fisher exact test
2: Hypergeometric distribution
3: Binomial distribution
4: Chi-squared distribution
multi_test_corr Multiple testing correction test [default=3].
1: Bonferroni
2: Bonferroni-Holm
3: Benjamini-Hochberg
min_map_ids Minimum number of gene IDs from the user list (id_file) must be mapped to the background database for performing GenFam analysis [default=5]
alpha Significance level [float][default: 0.05]

Returns:

Attribute Description
df_enrich Enriched gene families with p < 0.05
genfam_info GenFam run information
Output files Output figures and files from GenFam analysis
genfam_enrich.png: GenFam figure for enriched gene families
fam_enrich_out.txt: List of enriched gene families with mapped gene IDs, GO annotation, and detailed statistics
fam_all_out.txt: List of all gene families with mapped gene IDs, GO annotation, and detailed statistics

Description and working example

Check allowed ID types for plant species for GenFam

latest update v1.0.0

bioinfokit.analys.genfam.check_allowed_ids(species)

Parameters Description
species Plant species ID to check for allowed ID type. All plant species ID provided here

Returns:

Allowed ID types for GenFam

Description and working example

Biostatistical analysis

Correlation matrix plot

latest update v2.0.1

bioinfokit.visuz.stat.corr_mat(table, corm, cmap, r, dim, show, figtype, axtickfontsize, axtickfontname, theme)

Parameters Description
table Dataframe object with numerical variables (columns) to find correlation. Ideally, you should have three or more variables. Dataframe should not have identifier column.
corm Correlation method [pearson,kendall,spearman] [default:pearson]
cmap Color Palette for heatmap [string][default: 'seismic']. More colormaps are available at https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 5)]
show Show the figure on console instead of saving in current folder [True or False][default:False]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
axtickfontsize Font size for axis ticks [float][default: 7]
axtickfontname Font name for axis ticks [string][default: 'Arial']
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

Correlation matrix plot image in same directory (corr_mat.png)

Working example

Bar-dot plot

latest update v0.8.5

bioinfokit.visuz.stat.bardot(df, colorbar, colordot, bw, dim, r, ar, hbsize, errorbar, dotsize, markerdot, valphabar, valphadot, show, figtype, axxlabel, axylabel, axlabelfontsize, axlabelfontname, ylm, axtickfontsize, axtickfontname, yerrlw, yerrcw)

Parameters Description
df Pandas dataframe object
colorbar Color of bar graph [string or list][default:"#bbcfff"]
colordot Color of dots on bar [string or list][default:"#ee8972"]
bw Width of bar [float][default: 0.4]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
r Figure resolution in dpi [int][default: 300]
ar Rotation of X-axis labels [float][default: 0]
hbsize Horizontal bar size for standard error bars [float][default: 4]
errorbar Draw standard error bars [bool (True or False)][default: True]
dotsize The size of the dots in the plot [float][default: 6]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
valphabar Transparency of bars on plot [float (between 0 and 1)][default: 1]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
ylm Range of ticks to plot on Y-axis [float Tuple (bottom, top, interval)][default: None]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: 'Arial']
yerrlw Error bar line width [float][default: None]
yerrcw Error bar cap width [float][default: None]

Returns:

Bar-dot plot image in same directory (bardot.png)

Working Example

One sample and two sample Z-tests

latest update v2.1.0

bioinfokit.analys.stat.ztest(df, x, y, mu, x_std, y_std, alpha, test_type)

Parameters Description
df Pandas dataframe for appropriate Z-test.
One sample: It should have atleast one variable
Two sample independent: It should have atleast two variables
x column name for x group [string][default: None]
y column name for x group [string][default: None]
mu Population or known mean for the one sample Z-test [float][default: None]
x_std Population standard deviation for x group [float][default: None]
y_std Population standard deviation for y group [float][default: None]
alpha Significance level for confidence interval (CI). If alpha=0.05, then 95% CI will be calculated [float][default: 0.05]
test_type Type of Z-test [int (1,2)][default: None].
1: One sample Z-test
2: Two sample Z-test

Returns:

Summary output as class attribute (summary and result)

Description and Working example

One sample and two sample (independent and paired) t-tests

latest update v2.1.0

bioinfokit.analys.stat.ttest(df, xfac, res, evar, alpha, test_type, mu)

Parameters Description
df Pandas dataframe for appropriate t-test.
One sample: It should have atleast dependent (res) variable
Two sample independent: It should have independent (xfac) and dependent (res) variables
Two sample paired: It should have two dependent (res) variables
xfac Independent group column name with two levels [string][default: None]
res Dependent variable column name [string or list or Tuple][default: None]
evar t-test with equal variance [bool (True or False)][default: True]
alpha Significance level for confidence interval (CI). If alpha=0.05, then 95% CI will be calculated [float][default: 0.05]
test_type Type of t-test [int (1,2,3)][default: None].
1: One sample t-test
2: Two sample independent t-test
3: Two sample paired t-test
mu Population or known mean for the one sample t-test [float][default: None]

Returns:

Summary output as class attribute (summary and result)

Description and Working example

Chi-square test

latest update v0.9.5

bioinfokit.analys.stat.chisq(df, p)

Parameters Description
df Pandas dataframe. It should be one or two-dimensional contingency table.
p Theoretical expected probabilities for each group. It must be non-negative and sum to 1. If p is provide Goodness of Fit test will be performed [list or Tuple][default: None]

Returns:

Summary and expected counts as class attributes (summary and expected_df)

Working example

Linear regression analysis

bioinfokit.visuz.stat.lin_reg(df, x, y)

Parameters Description
df Pandas dataframe object
x Name of column having independent X variables [list][default:None]
y Name of column having dependent Y variables [list][default:None]

Returns:

Regression analysis summary

Working Example

Regression plot

latest update v2.0.1

bioinfokit.visuz.stat.regplot(df, x, y, yhat, dim, colordot, colorline, r, ar, dotsize, markerdot, linewidth, valphaline, valphadot, show, figtype, axxlabel, axylabel, axlabelfontsize, axlabelfontname, xlm, ylm, axtickfontsize, axtickfontname, theme)

Parameters Description
df Pandas dataframe object
x Name of column having independent X variables [string][default:None]
y Name of column having dependent Y variables [string][default:None]
yhat Name of column having predicted response of Y variable (y_hat) from regression [string][default:None]
dim Figure size [Tuple of two floats (width, height) in inches][default: (6, 4)]
r Figure resolution in dpi [int][default: 300]
ar Rotation of X-axis labels [float][default: 0]
dotsize The size of the dots in the plot [float][default: 6]
markerdot Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
valphaline Transparency of regression line on plot [float (between 0 and 1)][default: 1]
valphadot Transparency of dots on plot [float (between 0 and 1)][default: 1]
linewidth Width of regression line [float][default: 1]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
show Show the figure on console instead of saving in current folder [True or False][default:False]
axxlabel Label for X-axis. If you provide this option, default label will be replaced [string][default: None]
axylabel Label for Y-axis. If you provide this option, default label will be replaced [string][default: None]
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
xlm Range of ticks to plot on X-axis [float Tuple (bottom, top, interval)][default: None]
ylm Range of ticks to plot on Y-axis [float Tuple (bottom, top, interval)][default: None]
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: 'Arial']
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

Regression plot image in same directory (reg_plot.png)

Working Example

Tukey HSD test

latest update v1.0.3

bioinfokit.analys.stat.tukey_hsd(df, res_var, xfac_var, anova_model, phalpha, ss_typ)

It performs multiple pairwise comparisons of treatment groups using Tukey's HSD (Honestly Significant Difference) test to check if group means are significantly different from each other. It uses the Tukey-Kramer approach if the sample sizes are unequal among the groups.

Parameters Description
df Pandas dataframe with the variables mentioned in the res_var, xfac_var and anova_model options. It should not have missing data. The missing data will be omitted.
res_var Name of a column having response variable [string][default: None]
xfac_var Name of a column having factor or group for pairwise comparison [string][default: None]
anova_model ANOVA model (calculated using statsmodels ols function) [string][default: None]
phalpha Significance level [float][default: 0.05]
ss_typ Type of sum of square to perform ANOVA [int][default: 2]

Returns:

Attribute Description
tukey_summary Pairwise comparisons for main and interaction effects by Tukey HSD test

Description and Working example

Bartlett's test

latest update v1.0.3

bioinfokit.analys.stat.bartlett(df, xfac_var, res_var)

It performs Bartlett's test to check the homogeneity of variances among the treatment groups. It accepts the input table in a stacked format. More details https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.bartlett.html

Parameters Description
df Pandas dataframe containing response (res_var) and independent variables (xfac_var) in a stacked format. It should not have missing data. The missing data will be omitted.
res_var Name of a column having response variable [string][default: None]
xfac_var Name of a column having treatment groups (independent variables) [string or list][default: None]

Returns:

Attribute Description
bartlett_summary Pandas dataframe containing Bartlett's test statistics, degree of freedom, and p value

Description and Working example

Levene's test

latest update v1.0.3

bioinfokit.analys.stat.levene(df, xfac_var, res_var)

It performs Levene's test to check the homogeneity of variances among the treatment groups. It accepts the input table in a stacked format. More details https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.levene.html

Parameters Description
df Pandas dataframe containing response (res_var) and independent variables (xfac_var) in a stacked format. It should not have missing data. The missing data will be omitted.
res_var Name of a column having response variable [string][default: None]
xfac_var Name of a column having treatment groups (independent variables) [string or list][default: None]
center Choice for the Levene's test [string (median, mean, trimmed)] [default: median]
median: Brown-Forsythe Levene-type test
mean: original Levene's test
trimmed: Brown-Forsythe Levene-type test

Returns:

Attribute Description
levene_summary Pandas dataframe containing Levene's test statistics, degree of freedom, and p value

Description and Working example

ROC plot

latest update v2.0.1

bioinfokit.visuz.stat.roc(fpr, tpr, c_line_style, c_line_color, c_line_width, diag_line, diag_line_style, diag_line_width, diag_line_color, auc, shade_auc, shade_auc_color, axxlabel, axylabel, axtickfontsize, axtickfontname, axlabelfontsize, axlabelfontname, plotlegend, legendpos, legendanchor, legendcols, legendfontsize, legendlabelframe, legend_columnspacing, dim, show, figtype, figname, r, ylm, theme)

Receiver operating characteristic (ROC) curve for visualizing classification performance

Parameters Description
fpr Increasing false positive rates obtained from sklearn.metrics.roc_curve [list][default:None]
tpr Increasing true positive rates obtained from sklearn.metrics.roc_curve [list][default:None]
c_line_style Line style for ROC curve [string][default:'-']
c_line_color Line color for ROC curve [string][default:'#f05f21']
c_line_width Line width for ROC curve [float][default:1]
diag_line Plot reference line [True or False][default: True]
diag_line_style Line style for reference line [string][default:'--']
diag_line_width Line width for reference line [float][default:1]
diag_line_color Line color for reference line [string][default:'b']
auc Area under ROC. It can be obtained from sklearn.metrics.roc_auc_score [float][default: None]
shade_auc Shade are for AUC [True or False][default: False]
shade_auc_color Shade color for AUC [string][default: '#f48d60']
axxlabel Label for X-axis [string][default: 'False Positive Rate (1 - Specificity)']
axylabel Label for Y-axis [string][default: 'True Positive Rate (Sensitivity)']
axtickfontsize Font size for axis ticks [float][default: 9]
axtickfontname Font name for axis ticks [string][default: 'Arial']
axlabelfontsize Font size for axis labels [float][default: 9]
axlabelfontname Font name for axis labels [string][default: 'Arial']
plotlegend plot legend [True or False][default:True]
legendpos position of the legend on plot. For more options see loc parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [string ][default:'lower right']
legendanchor position of the legend outside of the plot. For more options see bbox_to_anchor parameter at https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html [list][default:None]
legendcols Number of columns for legends [int][default: 1]
legendfontsize Font size for the legends [float][default:8]
legendlabelframe Box frame for the legend [True or False][default: False]
legend_columnspacing Spacing between the legends [float][default: None]
dim Figure size [Tuple of two floats (width, height) in inches][default: (5, 4)]
show Show the figure on console instead of saving in current folder [True or False][default:False]
figtype Format of figure to save. Supported format are eps, pdf, pgf, png, ps, raw, rgba, svg, svgz [string][default:'png']
figname name of figure [string ][default:'roc']
r Figure resolution in dpi [int][default: 300]. Not compatible with show= True
ylm Range of ticks to plot on Y-axis [float (bottom, top, interval)][default: None]
theme Change background theme. If theme set to dark, the dark background will be produced instead of white [string][default:'None']

Returns:

ROC plot image in same directory (roc.png) Working example

Regression metrics

Calculate Root Mean Square Error (RMSE), Mean squared error (MSE), Mean absolute error (MAE), and Mean absolute percent error (MAPE) from regression fit

latest update v1.0.8

bioinfokit.analys.stat.reg_metric(y, yhat, resid)

Parameters Description
y Original values for dependent variable [numpy array] [default: None]
yhat Predicted values from regression [numpy array] [default: None]
resid Regression residuals [numpy array][default: None]

Returns:

Pandas dataframe with values for RMSE, MSE, MAE, and MAPE

Working example

Venn Diagram

bioinfokit.visuz.venn(vennset, venncolor, vennalpha, vennlabel)

Parameters Description
vennset Venn dataset for 3 and 2-way venn. Data should be in the format of (100,010,110,001,101,011,111) for 3-way venn and 2-way venn (10, 01, 11) [default: (1,1,1,1,1,1,1)]
venncolor Color Palette for Venn [color code][default: ('#00909e', '#f67280', '#ff971d')]
vennalpha Transparency of Venn [float (0 to 1)][default: 0.5]
vennlabel Labels to Venn [string][default: ('A', 'B', 'C')]

Returns:

Venn plot (venn3.png, venn2.png)

Working example

References:

  • Travis E. Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006).
  • John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, 90-95 (2007), DOI:10.1109/MCSE.2007.55 (publisher link)
  • Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21-29 (2007), DOI:10.1109/MCSE.2007.53 (publisher link)
  • Michael Waskom, Olga Botvinnik, Joel Ostblom, Saulius Lukauskas, Paul Hobson, MaozGelbart, … Constantine Evans. (2020, January 24). mwaskom/seaborn: v0.10.0 (January 2020) (Version v0.10.0). Zenodo. http://doi.org/10.5281/zenodo.3629446
  • Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830 (2011)
  • Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)
  • Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17(3), 261-272.
  • David C. Howell. Multiple Comparisons With Unequal Sample Sizes. https://www.uvm.edu/~statdhtx/StatPages/MultipleComparisons/unequal_ns_and_mult_comp.html

Last updated: November 20, 2021

bioinfokit's People

Contributors

jilpulvino avatar reneshbedre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bioinfokit's Issues

from bioinfokit.analys import stat

I keep getting this error message when i try to run the line: from bioinfokit.analys import stat
Traceback (most recent call last):
File "C:\Users\victo\PycharmProjects\Python\Math Modeling\0728 Homework\Tukey HSD Test.py", line 1, in
from bioinfokit.analys import stat
File "C:\Users\victo\PycharmProjects\bioinfokit\bioinfokit\analys.py", line 15, in
from bioinfokit.visuz import general
File "C:\Users\victo\PycharmProjects\bioinfokit\bioinfokit\visuz.py", line 12, in
from matplotlib_venn import venn3, venn2
ModuleNotFoundError: No module named 'matplotlib_venn'

Volcano plot - AssertionError: either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include both significant and non-significant genes

Hi,

The library is installed correctly and the command works with no problem on the example data,
but when I am running the library on my data I get the following error:
>>>visuz.gene_exp.volcano(df=df, lfc='Fold_change', pv='p_value', show=True)
AssertionError: either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include both significant and non-significant genes
I tried playing with the thresholds and excluding columns/rows with NaN values. Nothing helped.

Any advice?

coloring points in PCA plots

Hello-
First, this is a great tool! Thanks so much for the documentation. Like jwill490 who commented Nov 2020, I too am interested in coloring points in the PCA plots by a metadata attribute. I have 24 samples, 8 observations. I appreciate any assistance you could offer.
Thanks again!
Melody
DEG_hg38.csv

It is possible to create a volcano plot without significant genes?

Hi!
I have read through #22, #31 and #39 which are similar, as they show the error AssertionError: either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include both significant and non-significant genes, but I can't find a way to plot a volcano plot when there are no significant genes. I am using these plots as an exploratory analysis of several datasets and it would be very useful for me to see cases where there are no significant genes.

Thanks!

is it possible to save the plot in another directory?

I didn't find a parameter to save it in another location

And another problem is , if I have no down-reg genes, it always have a error infomation to let me change the cutoff, is it possible to just leave it blank in either up- or down-reg?

Manhattan plot with fst

Hello developers,

I would like to know if its advisable to use bioinfokit to make manhattan plot on my data which has chromosome(in the format Pf3D7_{1-13}_v3), position and fst values.

Thanks

manhatten plot y axis tick labels

Hi!

Thank you for this amazing suite of tools. It has made making a manhatten plot a breeze!

I am having some trouble getting my Y-axis tick labels to look right, and I thought maybe you would have some advice. First, I did not specify the ylm argument and looked at the default labeling. There is a lot of number overlap, so then I tried setting the axis limits and intervals with the ylm argument to ylm = (0, 35, 3). It still isn't quite right and I would love your input.

All the best,
Sabrina
default_params_manhatten
manhatten

Pip package 0.9.5 doesn't include all necessary modules (missing adjustText module)

Installing 0.9.5 doesn't include the adjustText module by default.

Example:

conda create -n bioinfokit-test python=3.7 -y
conda activate bioinfokit-test
pip install bioinfokit==0.9.5
python -c 'from bioinfokit import analys'
conda deactivate
conda env remove -n bioinfokit-test

gives the following error at the import step:

#Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/cmcleod/conda/envs/bioinfokit-test/lib/python3.7/site-packages/bioinfokit/analys.py", line 6, in <module>
    from bioinfokit.visuz import screeplot, pcaplot, general
  File "/Users/cmcleod/conda/envs/bioinfokit-test/lib/python3.7/site-packages/bioinfokit/visuz.py", line 12, in <module>
    from adjustText import adjust_text
ModuleNotFoundError: No module named 'adjustText'

module 'bioinfokit.visuz' has no attribute 'GeneExpression'

Hi.

Thanks for sharing nice scripts. Recently I wanted to make volcano plot.
However I got this error "module 'bioinfokit.visuz' has no attribute 'GeneExpression'

I updated the bioinfokit. but no success.

Could you let me know what's happening? I am using python (ver 3.7 or 3.8) on Windows 10 home version.

update: I just installed it in windows command line window by : pip install bioinfokit
It installed v 0.6.xxxx
updated it pip --upgrade bioinfokit
Here is the error screen below

=======================================================================
Requirement already satisfied: bioinfokit in c:\programdata\anaconda3\lib\site-packages (0.6)
Collecting bioinfokit
Using cached bioinfokit-2.0.8.tar.gz (84 kB)
ERROR: Command errored out with exit status 1:
command: 'C:\ProgramData\Anaconda3\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\xxx\AppData\Local\Temp\pip-install-421w2xln\bioinfokit_579b051cf380446c8f48ff558d85d896\setup.py'"'"'; file='"'"'C:\Users\xxx\AppData\Local\Temp\pip-install-421w2xln\bioinfokit_579b051cf380446c8f48ff558d85d896\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\xxx\AppData\Local\Temp\pip-pip-egg-info-xg3tboln'
cwd: C:\Users\xxx\AppData\Local\Temp\pip-install-421w2xln\bioinfokit_579b051cf380446c8f48ff558d85d896
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\xxx\AppData\Local\Temp\pip-install-421w2xln\bioinfokit_579b051cf380446c8f48ff558d85d896\setup.py", line 5, in
long_description = fh.read()
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 62525: illegal multibyte sequence
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/8f/a5/e23716d25cb293873f64a372ae93e3d013da726838792f5d670b95562021/bioinfokit-2.0.8.tar.gz#sha256=6155ef2566b76b731c10d09ba2828941c619bc8a0a838ed21cd58d68cef90977 (from https://pypi.org/simple/bioinfokit/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Here is the code I used below.

=================================

import packages

import pandas as pd
from bioinfokit import analys, visuz

import the DGE table (condition_infected_vs_control_dge.csv)

df = pd.read_csv("xxxxxxx/xxx_dge.csv")

drop NA values

df=df.dropna()

create volcano plot

visuz.GeneExpression.volcano(df=df, lfc='log2FoldChange', pv='padj', sign_line=True, plotlegend=True)

Volcano plot title

Could you please add a title argument to the volcano plot function? I see that the ma plot function already has a title argument, so I assume that it would be possible to do the same for the volcano plot. Thank you.

genenames='deg' don't work in volcano plot script

Hi,
very interesting project. I am trying to do the tutorial of the webpage, but when I try to do a volcano plot with labeling all DEGs it give me several errors. My code is:

visuz.gene_exp.volcano(df=df, lfc='log2FC', pv='p-value', lfc_thr=(1, 2), pv_thr=(0.05, 0.01), genenames='deg')

and the errors are

Traceback (most recent call last):
  File "/Users/arturo/miniconda3/envs/rnaseq/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/arturo/miniconda3/envs/rnaseq/lib/python3.6/site-packages/bioinfokit/visuz.py", line 422, in volcano
    gene_exp.geneplot(df, geneid, lfc, lfc_thr, pv_thr, genenames, gfont, pv, gstyle)
  File "/Users/arturo/miniconda3/envs/rnaseq/lib/python3.6/site-packages/bioinfokit/visuz.py", line 337, in geneplot
    for i in d[geneid].unique():
  File "/Users/arturo/miniconda3/envs/rnaseq/lib/python3.6/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/arturo/miniconda3/envs/rnaseq/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: None

Can you help me?

Thanks

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 63494: illegal multibyte sequence

Collecting bioinfokit
Using cached https://files.pythonhosted.org/packages/e1/32/e1581d40b9e1f88c496b854002ab00f45c47880909f8a529ced2e7ea7314/bioinfokit-2.0.6.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\user\AppData\Local\Temp\pycharm-packaging\bioinfokit\setup.py", line 5, in
long_description = fh.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 63494: illegal multibyte sequence

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in C:\Users\user\AppData\Local\Temp\pycharm-packaging\bioinfokit\

Environment (please complete the following information):

OS: windows10
Python version: 3.7
bioinfokit version: 2.0.6

Normalization on TCGA dataset

Hello bioinfokit team, I have downloaded raw htseq gene expression data from TCGA and I would like to perform FPKM normalization. But on checking the documentation the commands require the gene lengths but I don't have that info. Is there a way to normalize without supplying the gene length? Thanks

Coloring PCA plot points by metadata

Hi there,

Is there anyway to color the points in the PCA plots by a metadata attribute without forking the project? For example, I have 4 sample types (fetal12w, fetal13w, adult12w, adult13w) across 29 samples. I'd like the samples to be color-coded by these groups if at all possible.

Thanks!
John

What the df in the Normalization part

Hello Dr. Bedre,
Thanks for you developed this tool for bioinformatic analysis, it useful do much.
When I caculate the CPM, RPKM or TPM, there a question confused me. You mentioned the df is Pandas dataframe containing raw gene expression values, How can I understand the expression values ? It is the mapping counts or any other number? if it is anyother number, how do you generate it?
Thanks

Axis limits cannot be NaN or Inf

Greetings Renesh-
I echo other users when I say that this is a great tool! Thank you very much. I am attempting to generate a Manhattan plot, but am met with the message "Axis limits cannot be NaN or Inf. My smallest p-value is 4.9E-323. Might you offer some advice on how to overcome the error, please?
Thank you-
Melody

Labeling most significant/highest log fold change?

I was wondering if it is possible to label the values that are above a certain p-value and logfoldchange threshold with in the genenames parameter, instead of having to change the threshold within lfc_thr and pv_thr? Like by using indexes for tubles?
Thanks!

marker size and shape in bioinfokit.visuz.cluster.pcaplot

Hello,

I am using cluster.pcaplot to plot 2D and 3D plots for a set of data. Is there any option for marker size and marker shape that I can use?
currently marker labels overlap and they are not distinguishable.

Thanks,
Amir

cluster.pcaplot(x=loadings[0], y=loadings[1],show=True,plotlabels=True,axlabelfontsize=20,labels=df.columns.values,
var1=round(pca_out.explained_variance_ratio_[0]*100, 2),
var2=round(pca_out.explained_variance_ratio_[1]*100, 2))

Bug when plotting marker labels referenced by tuple

Hey,
thanks for this nice library. I believe there's a small bug in the manhattan plot when passing the the marker IDs as a tuple; I think the line should rather be
for i in markernames:

for i in df[markeridcol].unique():

And maybe it would also make sense to extend the previous line 278 to
elif markernames is not None and isinstance(markernames, (tuple, list)):
so that one can also pass lists instead of tuples.

Best,
Matthias

Can't install bioinfokit

executed command:
pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple/ bioinfokit

output:
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/

WARNING: Ignoring invalid distribution -illow (f:\python3.8\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (f:\python3.8\lib\site-packages)
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory'))': /simple/bioinfokit/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory'))': /simple/bioinfokit/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory'))': /simple/bioinfokit/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory'))': /simple/bioinfokit/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory'))': /simple/bioinfokit/
ERROR: Could not find a version that satisfies the requirement bioinfokit (from versions: none)
ERROR: No matching distribution found for bioinfokit
WARNING: Ignoring invalid distribution -illow (f:\python3.8\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (f:\python3.8\lib\site-packages)

executed command:
easy_install bioinfokit

output:
WARNING: The easy_install command is deprecated and will be removed in a future version.
Searching for bioinfokit
Reading https://pypi.org/simple/bioinfokit/
Downloading https://files.pythonhosted.org/packages/e1/32/e1581d40b9e1f88c496b854002ab00f45c47880909f8a529ced2e7ea7314/bioinfokit-2.0.6.tar.gz#sha256=5552eecd98f3ad90bc39939bfbb50a933c7b0a4159360bc386555795c035e8e3
Best match: bioinfokit 2.0.6
Processing bioinfokit-2.0.6.tar.gz
Writing C:\Users\MSI\AppData\Local\Temp\easy_install-sjpi64h8\bioinfokit-2.0.6\setup.cfg
Running bioinfokit-2.0.6\setup.py -q bdist_egg --dist-dir C:\Users\MSI\AppData\Local\Temp\easy_install-sjpi64h8\bioinfokit-2.0.6\egg-dist-tmp-gju5ypax
Traceback (most recent call last):
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 154, in save_modules
yield saved
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 195, in setup_context
yield
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 250, in run_setup
_execfile(setup_script, ns)
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 45, in _execfile
exec(code, globals, locals)
File "C:\Users\MSI\AppData\Local\Temp\easy_install-sjpi64h8\bioinfokit-2.0.6\setup.py", line 5, in
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 63494: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "f:\python3.8\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "f:\python3.8\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "F:\python3.8\Scripts\easy_install.exe_main
.py", line 7, in
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 2307, in main
setup(
File "f:\python3.8\lib\site-packages\setuptools_init
.py", line 165, in setup
return distutils.core.setup(**attrs)
File "f:\python3.8\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "f:\python3.8\lib\distutils\dist.py", line 966, in run_commands
self.run_command(cmd)
File "f:\python3.8\lib\distutils\dist.py", line 985, in run_command
cmd_obj.run()
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 425, in run
self.easy_install(spec, not self.no_deps)
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 686, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 712, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 897, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 1167, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "f:\python3.8\lib\site-packages\setuptools\command\easy_install.py", line 1151, in run_setup
run_setup(setup_script, args)
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 253, in run_setup
raise
File "f:\python3.8\lib\contextlib.py", line 131, in exit
self.gen.throw(type, value, traceback)
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 195, in setup_context
yield
File "f:\python3.8\lib\contextlib.py", line 131, in exit
self.gen.throw(type, value, traceback)
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 166, in save_modules
saved_exc.resume()
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 141, in resume
six.reraise(type, exc, self._tb)
File "f:\python3.8\lib\site-packages\setuptools_vendor\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 154, in save_modules
yield saved
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 195, in setup_context
yield
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 250, in run_setup
_execfile(setup_script, ns)
File "f:\python3.8\lib\site-packages\setuptools\sandbox.py", line 45, in _execfile
exec(code, globals, locals)
File "C:\Users\MSI\AppData\Local\Temp\easy_install-sjpi64h8\bioinfokit-2.0.6\setup.py", line 5, in
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 63494: illegal multibyte sequence

OS: Windows10
Python version: 3.8

Slightly confusing error message if all are significant

This is a very unlikely scenario I know, I just thought it could be useful to let you know.
On line 156, the error message says "either significant up or down genes are missing; try to lower lfc_thr or ' \ 'increase pv_thr'" if the number of colours is not three. However, if you for some reason, like I did, have a testfile with only significant values, this message is very confusing. I know it's not close to real data, but i just thought you should know.

Multiple plot windows

Hi @reneshbedre I was trying to use bioinfokit to generate heatmaps for gene expression data following your example here. Whenever I use show =True, multiple windows are opened which just one of them displaying the heatmap. Please advice

Possible typos - color_result in visuz.py

When I run the function biplot(), I got an error message: " name 'color_result' is not defined". After looking through visuz.py. I found the following codes (line 1594-1605):

if colordot and isinstance(colordot, (tuple, list)):
                        colour_map = ListedColormap(colordot)
                        # for i in range(len(list(unique_class))):
                        #    color_dict[list(unique_class)[i]] = colordot[i]
                        # color_result = [color_dict[i] for i in colorlist]
                        s = plt.scatter(cscore[:, 0] * xscale, cscore[:, 1] * yscale, c=color_result_num, cmap=colour_map,
                                        s=dotsize, alpha=valphadot, marker=markerdot)
                        plt.legend(handles=s.legend_elements()[0], labels=list(unique_class), loc=legendpos)
                    elif colordot and not isinstance(colordot, (tuple, list)):
                        # s = plt.scatter(cscore[:, 0] * xscale, cscore[:, 1] * yscale, color=color_result, s=dotsize,
                        #                alpha=valphadot, marker=markerdot)
                        # plt.legend(handles=s.legend_elements()[0], labels=list(unique_class))
                        s = plt.scatter(cscore[:, 0] * xscale, cscore[:, 1] * yscale, c=color_result, s=dotsize,
                                    alpha=valphadot, marker=markerdot)
                        plt.legend(handles=s.legend_elements()[0], labels=list(unique_class), loc=legendpos)

As far as I am concerned, the " s = plt.scatter(cscore[:, 0] * xscale, cscore[:, 1] * yscale, c=color_result, s=dotsize,
alpha=valphadot, marker=markerdot)"
should be :
s = plt.scatter(cscore[:, 0] * xscale, cscore[:, 1] * yscale, c=color_result_num, s=dotsize,
alpha=valphadot, marker=markerdot)

What are your thoughts?

Error with vcf file split

Hi
VCF split can fall with an error:
UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U4')) -> None

In file analys.py

455 sub_df = read_vcf_file_df[read_vcf_file_df[id]==chrom_ids[r]]
456 # out_vcf_file = open(chrom_ids[r]+'.vcf'
457 with open(chrom_ids[r]+'.vcf', 'w') as out_vcf_file:
458 for l in info_lines:
459 out_vcf_file.write(l+'\n')

I've split vcf file with chromosomes named: 1, 2, 3 etc., and found this error. Please check it.

I think that default str() type change will help to avoid this kind of problem in the future.
I suggest to change lines 457 and 460 to this:
457 with open(str(chrom_ids[r])+'.vcf', 'w') as out_vcf_file:
...
460 sub_df.to_csv(str(chrom_ids[r])+'.vcf', mode='a', sep='\t', index=False)

No images

Dear all,
When I draw a volcano plot using visuz.volcano(), there is no image or error report appear.
Does anyone know what is the problem here?
Thank you!

image

Consider uploading to bioconda

Really great library! Thanks for making it—it's saved me quite a bit of time.

I'm using this to perform tpm conversion in a Jupyter notebook. My entire environment can be installed with a conda install with the exception of this tool. Please consider adding it as a bioconda recipe.

NameError: name 'color_result' is not defined

Error when trying to create PCA biplot with target label. Following this tutorial:
https://www.reneshbedre.com/blog/principal-component-analysis.html

pca_scores = PCA().fit_transform(X_st)
cluster.biplot(cscore=pca_scores, loadings=loadings, labels=X.columns.values, var1=round(pca_out.explained_variance_ratio_[0]*100, 2),
var2=round(pca_out.explained_variance_ratio_[1]*100, 2), colorlist=target)


NameError Traceback (most recent call last)
in
7 pca_scores = PCA().fit_transform(X_st)
8 cluster.biplot(cscore=pca_scores, loadings=loadings, labels=X.columns.values, var1=round(pca_out.explained_variance_ratio_[0]*100, 2),
----> 9 var2=round(pca_out.explained_variance_ratio_[1]*100, 2), colorlist=target)

~/opt/anaconda3/envs/analysis/lib/python3.6/site-packages/bioinfokit/visuz.py in biplot(cscore, loadings, labels, var1, var2, var3, axlabelfontsize, axlabelfontname, figtype, r, show, markerdot, dotsize, valphadot, colordot, arrowcolor, valphaarrow, arrowlinestyle, arrowlinewidth, centerlines, colorlist, legendpos, datapoints, dim)
1604
1605 # only if y axis is positive
-> 1606 if pv:
1607 if y_pos > 0:
1608 pv_symb = general.pvalue_symbol(pv[i], sign_symbol_opts['symbol'])

NameError: name 'color_result' is not defined

AssertionError: either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include both significant and non-significant genes

Hi,

I am trying to do a volcano plot labelling all the DEGs with the next code

from bioinfokit import analys, visuz
import pandas as pd

csv_list = ["Def_results_Day2_mock_vs_Day2_rdORF6", \
"Def_results_Day2_mock_vs_Day2_rWT", \
"Def_results_Day2_mock_vs_Day4_rdORF6", \
"Def_results_Day2_mock_vs_Day4_rWT", \
"Def_results_Day2_mock_vs_Day6_rdORF6", \
"Def_results_Day2_mock_vs_Day6_rWT", \
"Def_results_Day2_rWT_vs_Day2_rdORF6", \
"Def_results_Day4_rWT_vs_Day4_rdORF6", \
"Def_results_Day6_rWT_vs_Day6_rdORF6"]


for x in csv_list:

    folder = "../deseq2/"
    path = folder + x + ".csv"

    df = pd.read_csv(path)

    visuz.gene_exp.volcano(df=df, geneid='gene_name', lfc='log2FoldChange', lfc_thr=(-6, 6), pv_thr=(1, 0), genenames='deg', pv='PValue', figname=x)

but it give me the next error:

Traceback (most recent call last):
  File "volcanoplot_script_v0.3.py", line 40, in <module>
    visuz.gene_exp.volcano(df=df, geneid='gene_name', lfc='log2FoldChange', lfc_thr=(-6, 6), pv_thr=(1, 0), genenames='deg', pv='PValue', figname=x)
  File "/Users/arturo/miniconda3/envs/rnaseq/lib/python3.6/site-packages/bioinfokit/visuz.py", line 405, in volcano
    'either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include ' \
AssertionError: either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include both significant and non-significant genes

I did some test with other values of lfc_thr and pv_thr, but it give me the same error. I add an example of the volcano plot in which I want to put the labels of the DEGs. Can you help me?
Def_results_Day2_mock_vs_Day2_rdORF6

error: dataframe contains non-numeric values in pv column

Hi.

some of pvalue of my data has xxxxxx e-10 form (My csv data contains 'xxxxxx e-10' for pvalue.). And volcano plot cannot handle exponential format.
I tried to convert it to float by using float() in dataframe. But it shows the same error.

Could you suggest any workaround for this?
Thanks,

TypeError: hmap() got an unexpected keyword argument 'rowclus'


TypeError Traceback (most recent call last)
in ()
2 df = df.set_index(df.columns[0])
3 df.head()
----> 4 visuz.gene_exp.hmap(df=np.log2(df), cmap='RdYlBu', dim=(5, 10), tickfont=(8, 4),clus=False,rowclus=True)

TypeError: hmap() got an unexpected keyword argument 'rowclus'

VCF annotations should either be added to the INFO field or the output should be a tab-separated document

Almost every VCF annotation tool out there adds annotations to either the INFO field or outputs a tab-delimited file (or does both). Adding new non-sample columns to a VCF is not annotation, and it breaks VCF specification, which states that all but the first 8 fixed fields must have genotype information per sample.

Please write your annotations as a tab-delimited output, or add them to the INFO field. Otherwise, the VCF is not usable downstream.

assertion error for non-significant genes in lfc

Is it possible to turn off the assertion check for the presence of non-significant genes. I have some datasets for which I do not see any down-regulated genes at all. I can't create a volcano plot due to this
AssertionError: either significant or non-significant genes are missing; try to change lfc_thr or pv_thr to include both significant and non-significant genes

issue importing analys & visuz

HI , I am facing following 2 errors while using bioinfokit:

  1. While using it via linux cmd line , all dependencies are installed, still facing following error:
    Python 3.8.2 (default, Apr 17 2020, 12:53:23)
    [GCC 7.5.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.

import bioinfokit
from bioinfokit import analys, visuz
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'analys' from 'bioinfokit' (unknown location)
from bioinfokit import analys
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'analys' from 'bioinfokit' (unknown location)
from bioinfokit import visuz
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'visuz' from 'bioinfokit' (unknown location)

2.While using this library bioinfokit via jupyter notebook I am unable to generate the plots.

any help regarding what could be the reason why I am facing these issues shall be vry
helpful.

How to properly adjust figure size and y axis label size

Hello Dr. Bedre,
First, thank you very much for the simple and useful heat map code.
I am new to this field and trying to find my way. Your source has been very helpful to me.
Here IS the heatmap I made from FPKM values, from cuffdiff. then zPFKM and used this matrix for heatmap. I changed dim and tickfont to have a better figure but still could get one.
My question is that maybe I need to have fewer gene lists to show on the y-axis? or otherwise, How do I make it legible?
I have 229 gene lists and if I can show all of them on the label, that would be great.
heatmap

Gene expression correlation analysis question

I'm testing out a simple proof-of-concept where I want to use bioinfokit's stat.corr_mat to calculate gene correlations from raw expression counts. My dataframe has ~11K columns each representing a gene, and there are ~11K rows each representing a patient with raw RNA-seq read count values for every gene. With the dataframe setup this way, could I expect to see correlation values between genes across patient samples?

I understand that I may need to normalize the data for such cross sample comparisons, but I wanted to make sure that I understood the basic operation first. This example differs from the worked example in that I would like to visualize gene-gene expression correlations and not fold change-treatment correlations. Is this an appropriate use of corr_mat? If not, is there another function in bioinfokit that may do what I'm trying to visualize?

Thank you in advance!

error with saving figure with figname

Hi Renesh,

Thanks for building this package.

I'm struggling with altering the file name using the parameter figname.

image

Please advise.

Thanks,
Justine

FASTQ batch downloads from SRA database not working for Windows

Hi,

It seems that the command "bioinfokit.analys.fastq.sra_bd" doesn't work for Windows. The issue seems to be the absence of fasterq function in the Windows version of the SRA toolkit.

Is there any other way to use this function or be able to download large batches of data from the SRA database in a single command?

Regards.

The VCF combine chromosomes operation should be callled "concat", not "merge"

It is common convention across popular tools such as the VCFtools PERL library and bcftools to differentiate between

  1. combining chromosome-specific VCF files into a single VCF file; and
  2. combining single sample VCF files into one multi-sample VCF file

The former is called concat and the latter, merge. However, in your tookit, you call the former merge, which is misleading. Plus, given that your README does not describe what the operation actually does, one needs to dig deep to understand what's going on.

Please rename the operation and add a line in the README addressing this.

Manhattan Plot Highlighting

Hi,

Thanks for making this! We need more python tools for analyzing genomic data.

Is it possible to highlight specific points in a Manhattan plot different colors (i.e. SNPs occuring in an exon are blue, SNPs in an intron are green....)?

read csv

How can i read my own csv file?

No so many modules within the installed latest version bioinfokit

Hi,
Bioinfokit is real a good tool to manage the common next-generation sequencing issues. It seems a bug that there are no so many modules within the installed latest version, or it may be my problem?

>>> import bioinfokit

>>> bioinfokit.version
'0.9.8'
>>> dir(bioinfokit)
['author', 'builtins', 'cached', 'doc', 'file', 'loader', 'name', 'package', 'path', 'spec', 'version', 'name']
>>> dir(bioinfokit.name)
['add', 'class', 'contains', 'delattr', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getitem', 'getnewargs', 'gt', 'hash', 'init', 'init_subclass', 'iter', 'le', 'len', 'lt', 'mod', 'mul', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'rmod', 'rmul', 'setattr', 'sizeof', 'str', 'subclasshook', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']"

Error in volcano plot

I am trying to use the volcano plot (visuz.GeneExpression.volcano) put I always get this error.

AssertionError: dataframe contains non-numeric values in pv column.

However, all data in pv column is numeric

Offsetting text and editing axis titles in biplots?

Currently using your (great!) tool for producing PCA biplots with the loadings on, however having a couple of minor issues.

Is there any way to edit the text on the axes? I want to also plot PC1 against PC3 etc and can't find a way to change the wording from PC2 (although the explained variance is picked up).

And I do have quite a number of variables, so the labels on the loading vectors currently writes over itself and is illegible. Is there a way to jitter or offset the labels?

Thanks for your help, this is a very useful tool.

Question regarding importing files

Hi Renesh,

Thank you for the awesome tool.
I have a quick question regarding importing my own data.
I am using version 1.0.5.
In order to import csv file , I tried the following df = analys.get_data(Day2.csv) however it didn't work. I also tried df = pd.read_csv(Day2.csv) then I was getting name error.
It would be great if you can help me in this regard. Also, appreciate if you can add import instruction to the instructions documentation.
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.