Giter Site home page Giter Site logo

oheil / normalizequantiles.jl Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 3.0 272 KB

NormalizeQuantiles.jl implements quantile normalization

License: MIT License

Julia 100.00%
quantile-normalization normalize-quantiles quantile-normalisation normalise-quantiles statistics microarray-data-analysis bioinformatics

normalizequantiles.jl's Introduction

Project Status: Active – The project has reached a stable, usable state and is being actively developed. PkgEval

deps version pkgeval

NormalizeQuantiles

For julia 0.4, 0.5, 0.6 see: https://github.com/oheil/NormalizeQuantiles.jl/tree/backport-0.6

Package NormalizeQuantiles implements quantile normalization

qn = normalizeQuantiles(array);

and provides a function to calculate sample ranks

(r,m) = sampleRanks(array);

of a given vector or matrix.

References

Table of Contents

Dependencies

Julia versions

  • Julia 0.7 or above

Third party packages

  • none

Standard Library packages

Remarks

Usage examples normalizeQuantiles

General usage

Pkg.add("NormalizeQuantiles");
using NormalizeQuantiles;

The following array is interpreted as a matrix with 4 rows and 3 columns:

array = [ 3.0 2.0 1.0 ; 4.0 5.0 6.0 ; 9.0 7.0 8.0 ; 5.0 2.0 8.0 ];
qn = normalizeQuantiles(array)
	julia> qn
	4×3 Array{Float64,2}:
	 2.0  3.0  2.0
	 4.0  6.0  4.0
	 8.0  8.0  7.0
	 6.0  3.0  7.0

The columns in qn are now quantile normalized to each other.

The input array must not have dimension larger than 2.

Return type of function normalizeQuantiles is always Array{Float64,2}

Missing Values

If your data contain some missing values like NaN (Not a Number) or something else, they will be changed to NaN:

array = [ NaN 2.0 1.0 ; 4.0 "empty" 6.0 ; 9.0 7.0 8.0 ; 5.0 2.0 8.0 ];
	julia> array
	4×3 Array{Any,2}:
	 NaN    2.0       1.0
	4.0   "empty"  6.0
	9.0  7.0       8.0
	5.0  2.0       8.0
qn = normalizeQuantiles(array)
	julia> qn
	4×3 Array{Float64,2}:
	 NaN      3.25  1.5
	   5.0  NaN     5.0
	   8.0    8.0   6.5
	   5.0    3.25  6.5

NaN is of type Float64, so there is nothing similar for Int types.

	julia> typeof(NaN)
	Float64

You can convert the result to Array{Union{Missing, Float64},2} with:

qnMissing = convert(Array{Union{Missing,Float64}},qn)
	julia> qnMissing
	4×3 Array{Union{Missing, Float64},2}:
	 NaN      3.25  1.5
	   5.0  NaN     5.0
	   8.0    8.0   6.5
	   5.0    3.25  6.5
qnMissing[isnan.(qnMissing)] = missing;
	julia> qnMissing
	4×3 Array{Union{Missing, Float64},2}:
	  missing  3.25      1.5
	 5.0        missing  5.0
	 8.0       8.0       6.5
	 5.0       3.25      6.5

SharedArray and multicore usage examples

Remark: restart julia now. addprocs() must be called before using NormalizeQuantiles;.

To use multiple cores on a single machine you can use the standard packages Distributed and SharedArrays:

using Distributed
addprocs();
@everywhere using SharedArrays
@everywhere using NormalizeQuantiles

sa = SharedArray{Float64}([ 3.0 2.0 1.0 ; 4.0 5.0 6.0 ; 9.0 7.0 8.0 ; 5.0 2.0 8.0 ])
	julia> sa
	4×3 SharedArray{Float64,2}:
	 3.0  2.0  1.0
	 4.0  5.0  6.0
	 9.0  7.0  8.0
	 5.0  2.0  8.0
qn = normalizeQuantiles(sa)
	julia> qn
	4×3 Array{Float64,2}:
	 2.0  3.0  2.0
	 4.0  6.0  4.0
	 8.0  8.0  7.0
	 6.0  3.0  7.0

Remark: restart julia again.

For small data sets using Distributed and SharedArrays decreases performance:

using NormalizeQuantiles
la = randn((100,100));
normalizeQuantiles(la); @time normalizeQuantiles(la);
	julia> @time normalizeQuantiles(la);
	  0.003178 seconds (8.35 k allocations: 2.813 MiB)

Remark: restart julia.

using Distributed
addprocs();
@everywhere using SharedArrays
@everywhere using NormalizeQuantiles
sa = SharedArray{Float64}( randn((100,100)) );
normalizeQuantiles(sa); @time normalizeQuantiles(sa);
	julia> @time normalizeQuantiles(sa);
	  0.013014 seconds (12.10 k allocations: 432.146 KiB)

Remark: restart julia.

For larger data sets performance increases with multicore processors:

using NormalizeQuantiles
la = randn((1000,10000));
normalizeQuantiles(la); @time normalizeQuantiles(la);
	julia> @time normalizeQuantiles(la);
	  2.959431 seconds (784.18 k allocations: 2.281 GiB, 12.13% gc time)

Remark: restart julia.

using Distributed
addprocs();
@everywhere using SharedArrays
@everywhere using NormalizeQuantiles
la = randn((1000,10000));
sa = SharedArray{Float64}(la);
normalizeQuantiles(sa); @time normalizeQuantiles(sa);
	julia> @time normalizeQuantiles(sa);
	  1.030016 seconds (83.85 k allocations: 80.754 MiB, 5.77% gc time)

Using non-SharedArrays in a multicore setup is slowest:

	julia> normalizeQuantiles(la); @time normalizeQuantiles(la);
	  5.776685 seconds (294.06 k allocations: 92.532 MiB, 0.28% gc time)

OffsetArrays

Remark: with Julia 1.3.1 OffsetArrays are not supported until JuliaLang/julia#34886 is released (expected in Julia 1.5)

using NormalizeQuantiles, OffsetArrays

array = [ 3 missing 1 ; 4 5 6 ; missing 7 8 ; 5 2 8 ];
oa = OffsetArray(array,-1,-1);
julia> oa
4×3 OffsetArray(::Array{Union{Missing, Int64},2}, 0:3, 0:2) with eltype Union{Missing, Int64} with indices 0:3×0:2:
 3          missing  1
 4         5         6
  missing  7         8
 5         2         8
qn = normalizeQuantiles(oa);

The quantile normalized result is not an OffsetArray:

julia> qn
4×3 Array{Float64,2}:
   2.0      NaN        2.0
   4.0        6.5      4.0
 NaN          6.66667  6.58333
   6.66667    4.0      6.58333

Behaviour of function normalizeQuantiles

After quantile normalization the sets of values of each column have the same statistical properties. This is quantile normalization without a reference column.

The function normalizeQuantiles expects an array with dimension <= 2 and always returns a matrix of same dimensions as the input matrix and of type Array{Float64,2}.

NaN values of type Float64 and any other non-numbers, like strings, are treated as random missing values and the result value will be NaN. See "Remarks on data with missing values" below.

Equal values in a column of the input matrix will have different quantile normalized values. Those different result values can't be assigned back to the proper original positions because they are indistinguishable. The mean value of the different result values are therefor put back into original positions.

Example:

julia> array = [ 1 2 ; 2  3 ; 2 5 ]
3×2 Matrix{Int64}:
 1  2
 2  3
 2  5

julia> qn = normalizeQuantiles(array)
3×2 Matrix{Float64}:
 1.5  1.5
 3.0  2.5
 3.0  3.5

In row 2 and 3 instead if 2.5 and 3.5 the mean 3.0 is the result in both rows.

Data prerequisites

To use quantile normalization your data should have the following properties:

  • the distribution of values in each column should be similar
  • number of values for each column should be large
  • number of missing values in the data should be small and of random nature

Remarks on data with missing values

Currently there seems to be no general agreement on how to deal with missing values during quantile normalization. Here we put any given missing value back into the sorted column at the original position before calculating the mean of the rows.

List of all exported definitions for normalizeQuantiles

normalizeQuantiles
Definition: Array{Float64,2} function normalizeQuantiles(matrix::AbstractArray)
Input type: matrix::AbstractArray
Return type: Array{Float64,2}

Usage examples sampleRanks

sampleRanks of a given vector calculates for each element the rank, which is the position of the element in the sorted vector.

using NormalizeQuantiles
a = [ 5.0 2.0 4.0 3.0 1.0 ];
(r,m) = sampleRanks(a);   # here only return value r is relevant, for m see below
r
	julia> r
	5-element Array{Union{Missing, Int64},1}:
	 5
	 2
	 4
	 3
	 1

If you provide a matrix like

array = [ 1.0 2.0 3.0 ; 4.0 5.0 6.0 ; 7.0 8.0 9.0 ; 10.0 11.0 12.0 ]
	julia> array
	4×3 Array{Float64,2}:
	  1.0   2.0   3.0
	  4.0   5.0   6.0
	  7.0   8.0   9.0
	 10.0  11.0  12.0

ranks are calculated column wise, or in other words, array is treated as array[:]:

(r,m) = sampleRanks(array);
r
	julia> r
	12-element Array{Union{Missing, Int64},1}:
	  1
	  4
	  7
	 10
	  2
	  5
	  8
	 11
	  3
	  6
	  9
	 12

There are three optional keyword parameters tiesMethod, naIncreasesRank and resultMatrix:

(r,m) = sampleRanks(a,tiesMethod=tmMin,naIncreasesRank=false,resultMatrix=true);
(r,m) = sampleRanks(a,resultMatrix=true);

Equal values in the vector are called ties. There are several methods available on how to treat ties:

  • tmMin : the smallest rank for all ties (default)
  • tmMax : the largest rank
  • tmOrder : increasing ranks
  • tmReverse : decreasing ranks
  • tmRandom : the ranks are randomly distributed
  • tmAverage : the average rounded to the next integer

These methods are defined and exported as

	@enum qnTiesMethods tmMin tmMax tmOrder tmReverse tmRandom tmAverage

Internally ties have increasing ranks. On these the chosen method is applied.

The next rank for the successive values after the ties is the so far highest used rank plus 1.

Examples:

a = [ 7.0 2.0 4.0 2.0 1.0 ];
(r,m) = sampleRanks(a); #which is the same as (r,m)=sampleRanks(a,tiesMethod=tmMin)
r
	julia> r
	5-element Array{Union{Missing, Int64},1}:
	 4
	 2
	 3
	 2
	 1
(r,m) = sampleRanks(a,tiesMethod=tmMax);
r
	julia> r
	5-element Array{Union{Missing, Int64},1}:
	 5
	 3
	 4
	 3
	 1
(r,m) = sampleRanks(a,tiesMethod=tmReverse);
r
	julia> r
	5-element Array{Union{Missing, Int64},1}:
	 5
	 3
	 4
	 2
	 1

One or more missing values in the vector are never equal and remain on there position after sorting. The rank of each missing value is always missing::Missing. The default is that a missing value does not increase the rank for successive values. Giving true keyword parameter naIncreasesRank changes that behavior to increasing the rank by 1 for successive values:

a = [ 7.0 2.0 4.0 2.0 1.0 ];
a[1] = NaN;
(r,m) = sampleRanks(a);
r
	julia> r
	5-element Array{Union{Missing, Int64},1}:
	  missing
	 2
	 3
	 2
	 1
(r,m) = sampleRanks(a,naIncreasesRank=true);
r
	julia> r
	5-element Array{Union{Missing, Int64},1}:
	  missing
	 3
	 4
	 3
	 2

The keyword parameter resultMatrix lets you generate a dictionary of rank indices to allow direct access to all values with a given rank. For large vectors this may have a large memory consumption therefor the default is to return an empty dictionary of type Dict{Int64,Array{Int64,N}}:

a = [ 7.0 2.0 4.0 2.0 1.0 ];
(r,m) = sampleRanks(a,resultMatrix=true);
m
	julia> m
	Dict{Int64,Array{Int64,N} where N} with 4 entries:
	  4 => [1]
	  2 => [2,4]
	  3 => [3]
	  1 => [5]
haskey(m,2)   #does rank 2 exist?
	julia> haskey(m,2)
	true
a[m[2]]   #all values of rank 2
	julia> a[m[2]]
	2-element Array{Float64,1}:
	 2.0
	 2.0

List of all exported definitions for sampleRanks

sampleRanks
Definition: @enum qnTiesMethods tmMin tmMax tmOrder tmReverse tmRandom tmAverage
Description:
tmMin the smallest rank for all ties
tmMax the largest rank
tmOrder increasing ranks
tmReverse decreasing ranks
tmRandom the ranks are randomly distributed
tmAverage the average rounded to the next integer
sampleRanks
Definition: (Array{Union{Missing,Int},1},Dict{Int,Array{Int}}) sampleRanks(array::AbstractArray; tiesMethod::qnTiesMethods=tmMin, naIncreasesRank=false, resultMatrix=false) keyword arguments
Input type: array::AbstractArray data
Input type: tiesMethod::qnTiesMethods how to treat ties (default: tmMin)
Input type: naIncreasesRank::bool increase rank by one if NA (default: false)
Input type: resultMatrix::bool create rank dictionary (default: false)
Return type: (Array{Union{Missing,Int},1},Dict{Int,Array{Int}})

normalizequantiles.jl's People

Contributors

juliatagbot avatar oheil avatar staticfloat avatar tkelman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

normalizequantiles.jl's Issues

new function to normalize DataFrame

Hi @oheil,

Thanks for writing this awesome package. I'm using it starting with DataFrames, and would like to contribute a helper function that takes a DataFrame and an array of columns to normalize, and then returns a copy of the original DataFrame with the target columns quantile normalized

myDF = DataFrame(Genes = ["A", "B", "C", "D", "E"], 
                    cond1 = rand(5),
                    cond2 = rand(5),
                    cond3 = rand(5))

should return something like

Genes cond1 cond2 cond3
1 A 0.32563181788519513 0.21438356281559678 0.8771217378138845
2 B 0.6870295079889066 0.6915367387396687 0.7535393139297728
3 C 0.2862851833941058 0.16805347512816704 0.8187249261058329
4 D 0.4158104577567945 0.24391342294094476 0.829337861985237
5 E 0.3787968401059918 0.4015742563211182 0.30126103627947454
# quantile normalize columns with expression data
function MynormalizeQuantiles(data::DataFrames.DataFrame, dataColumns::Array{Symbol,1})
    # get columns to NOT normalize
    otherCols = []
    for col in names(data)
        if !in(col, dataCols)
            push!(otherCols, col)
        end
    end

    # save this subset of the dataframe for later
    forLater = data[Array{Symbol,1}(otherCols)]

    # convert columns of interest to DataArray
    data_DA = DataArray(data[:, dataColumns])

    # melt into 1x(nrow*ncol) array with `for i in data_DA`
    # if NA ? return null : else return converted Float
    data_NA = [isna(i) ? Nullable{Float64}() : Nullable{Float64}(i) for i in data_DA]

    # convert back to original nrow x ncol shape and quantileNormalize
    quantNorm = normalizeQuantiles(reshape(data_NA, size(data_DA)))

    # convert back to DataArray
    quantNorm_DA = reshape(DataArray([isnull(i) ? NA : get(i) for i in quantNorm]), size(quantNorm))

    # build new dataframe by appending columns in order
    newData = DataFrame()
    for i in 1:length(dataColumns)
        newData = hcat(newData, DataFrame(Dict(dataColumns[i] => quantNorm_DA[:, i])))
    end

    # merge and order back to original
    newData = hcat(newData, forLater)[names(data)]

    return newData
end

# normalize myDF on all but :Genes column
MynormalizeQuantiles(myDF, deleteat!(names(myDF), 1))

should return something like

Genes cond1 cond2 cond3
1 A 0.43118489821018824 0.43118489821018824 0.7518959948474867
2 B 0.7518959948474867 0.7518959948474867 0.43118489821018824
3 C 0.2518665649339158 0.2518665649339158 0.4804783963842565
4 D 0.5489075253543833 0.4804783963842565 0.5489075253543833
5 E 0.4804783963842565 0.5489075253543833 0.2518665649339158

I tried to add this and submit a pull request myself using the instructions here, under Executive summary, but after adding this function (and changing the name of the above function to normalizeQuantiles, rather than my temporary name MynormalizeQuantiles) the tests failed. I've never worked with Julia packages and had no idea where to start troubleshooting that, but would be happy to retry a pull request if you could give me some pointers, or if it's easier, feel free to add it to the code base yourself.

I would also be interested in working on auto detection and usage of multiple cores, using your examples here. For example, detect the number of cores available with nprocs(), and if more than 2 are available, automatically use the SharedArray methodology.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.