fslaborg / fsharp.stats Goto Github PK
View Code? Open in Web Editor NEWstatistical testing, linear algebra, machine learning, fitting and signal processing in F#
Home Page: https://fslab.org/FSharp.Stats/
License: Other
statistical testing, linear algebra, machine learning, fitting and signal processing in F#
Home Page: https://fslab.org/FSharp.Stats/
License: Other
In Fitting.NonlinearRegression
a model for an exponential function in the form of "y(x)=a * exp(b * x)" is missing.
The linearization of an exponential function is often proposed, but underestimates the residuals at high y-values.
The documentation specifies that the package is available on Nuget – but currently it doesn't seem to be.
Accessing the provided Nuget link, one get's a 404 – not found.
Alternatively:
FSharp.Stats
Thanks!
Some of the functions in module Algebra are callable from outside even though they should be encapsulated.
Example:
FSharp.Stats.Algebra.EVD.SymmetricEvd
Include p-Value calculation for correlation coefficients. The easiest way is to use a ttest
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Testing_using_Student's_t-distribution
Is your feature request related to a problem? Please describe.
Not really a problem. Currently the only implemented ordination method in FSharp.Stats is PCA. PCA is applied on data where a linear relationship is expected and the distance measure of choice is euclidean distance.
Describe the solution you'd like
Addition/implementation of the following Multidimensional Scaling algorithms:
Principal Coordinates Analysis (PCoA, Classical Multidimensional Scaling'):
Non-metric multidimensional scaling (nMDS):
Describe the bug
When using the FSharp.Stats.Signal.PeakDetection.SecondDerivative.getPeaks function for the 1st polynomial order in combination with the second derivative, it results in an "System.IndexOutOfRangeException: Index was outside the bounds of the array." - error.
To Reproduce
Steps to reproduce the behavior:
I looked into the code and tried to narrow the origin of the error down and found it was the fault of the
Signal.Filtering.savitzky_golay function used for calculating the second derivative and for data smoothing.
open FSharp.Stats
///a = height of the gaussians peak, b = mean of gaussian distri, c = sigma of gaussian distribution,
///x = window on x -axis
let gaussianFunctionForY height mean sigma x = height*(exp( -((x-mean) ** 2.) / (2.*(sigma ** 2.)) ))
let XX = [|0. .. 0.1 .. 100.|]
let gaussianYY = XX |> Array.map (fun x -> gaussianFunctionForY 50. 30. 0.3 x)
getPeaks 0. 1 5 XX gaussianYY ///error seen above
let smoothedData =
Signal.Filtering.savitzky_golay 5 1 0 1 gaussianYY
|> Array.ofSeq ///no error, works as intended
Signal.Filtering.savitzky_golay 5 1 2 1 smoothedData ///error seen above
Signal.Filtering.savitzky_golay 5 1 2 1 gaussianYY ///error seen above
Signal.Filtering.savitzky_golay 5 1 1 1 gaussianYY //no error
Signal.Filtering.savitzky_golay 5 2 2 1 gaussianYY //no error
Apparently, the savitzky_golay function gives an error at the following line, but i am not 100% sure if i read the debugger correctly.
let m = (Algebra.LinearAlgebraManaged.pseudoInvers b).Row(deriv) * ((float(rate)**float(deriv)) * SpecialFunctions.Factorial.factorial(deriv))
Expected behavior
Should smooth data, while trying to fit the data to first order polynomials on second derivative.
OS and framework information (please complete the following information):
Additional context
I am using the FSharp.Stats developer branch
Smoothing splines are used for nonparametric fitting. If many y values are associated with one x value, an automatic weighting can be applied.
When using Spline.smoothingSpline identical x values are filtered, while the method seems to be able to cope with the processing of duplicates.
Reference: https://robjhyndman.com/etc5410/splines.pdf
The knots (basispts) have to be distinct so a filtering/warning step is necessary.
To reduce the impact of single data points to the regression, it is possible to assign weightings to the data (e.g. based on replicate variance or signal intensity).
Caution: Negative weigths are note allowed
It should be added to: Fitting.LinearRegression.OrdinaryLeastSquares.Polynomial
Curve and Surface Fitting, Academic Press, 1987 pp. 55
When using the hypergeometric distribution, big input numbers should be processed by using
exp (Binomial.coeffcientLn K k)
rather than Binomial.coeffcient K k
.
(hypergeometric 19714 29 1643).PDF 1
reports nan while
(hypergeometric 19714 29 1643).CDF 1. - (hypergeometric 19714 29 1643).CDF 0.
reports 0.21.
A change of
(SpecialFunctions.Binomial.coeffcient K k) * (SpecialFunctions.Binomial.coeffcient (N-K) (n-k)) / (SpecialFunctions.Binomial.coeffcient N n)
to
exp ((SpecialFunctions.Binomial.coeffcientLn K k) + (SpecialFunctions.Binomial.coeffcientLn (N-K) (n-k)) - SpecialFunctions.Binomial.coeffcientLn N n)
in Distributions.Discrete line 191 should fix this issue.
Describe the bug
GoodnessOfFit.calculateSST
does not report the total sum of squares but the regression sum of squares.
The following procedure reports the correct value:
let sos = GoodnessOfFit.calculateSumOfSquares fitFunc x_Data y_Data
sos.Total
Describe the bug
I tried using FSharp.Stats.Distributions.Empirical.create 1. dataPoints
to bin my dataPoints with a bandWidth of 1. Afterward i wanted to sample from the distribution via Distributions.Empirical.random
but it returned the following error every other time
> System.Collections.Generic.KeyNotFoundException: An index satisfying the predicate was not found in the collection.
at Microsoft.FSharp.Collections.SeqModule.Find[T](FSharpFunc`2 predicate, IEnumerable`1 source)
at <StartupCode$FSI_0249>.$FSI_0249.main@() in D:\Freym\Source\Repos\Freymaurer\Jupyter_PraktikumBiotech\Scripts\JP06_Retention_time_and_scan_time.fsx:line 86
Stopped due to error
To Reproduce
Steps to reproduce the behavior:
let file = "Chlamy_JGI5_5(Cp_Mp).fasta" // can be found in BioFSharp.Mz
let sequences =
filePath
|> FastA.fromFile BioArray.ofAminoAcidString
|> Seq.toArray
let digestedProteins =
sequences
|> Array.mapi (fun i fastAItem ->
Digestion.BioArray.digest Digestion.Table.Trypsin i fastAItem.Sequence
|> Digestion.BioArray.concernMissCleavages 0 0
)
|> Array.concat
let digestedPeptideMasses =
digestedProteins
|> Array.choose (fun peptide ->
let mass = BioSeq.toMonoisotopicMassWith (BioItem.monoisoMass ModificationInfo.Table.H2O) peptide.PepSequence
if mass < 3000. then Some mass else None
)
let empHis = FSharp.Stats.Distributions.Empirical.create 1. digestedPeptideMasses
Expected behavior
Should produce a distribution, from which one can randomly sample according to the propabilities.
Additional context
I already looked in the FSharp.Stats.Distribution.Empirical
module and think that the Distributions.Empirical.random
function is just fine, but that the FSharp.Stats.Distributions.Empirical.create
function is not correct when calculating the area
subfunction. It seems unintuitive.
For my example above the total of all summed up propabilities was float = 0.6921621451
, so everytime the random generated target
subfunction of Distributions.Empirical.random
produced an value above the ~0.69 it returned an error.
Is your feature request related to a problem? Please describe.
The developer branch is now (5c1d95a) containing a levenberg marquardt variant that supports box constrains and an classic implementation.
Describe the solution you'd like
Refactor the code yielding a single implementation that can be run constrained and unconstrained.
Distribution distance metrics are a good way to determine which distributions are more similar to each other based just on the absolute values.
Is your feature request related to a problem? Please describe.
During exercises it became clear, that the functions in this module miss certain functionalities and also a proper documentation.
Describe the solution you'd like
I will try and extend the current module with additional functions and documentation while also keeping the current function as they are right now.
Describe alternatives you've considered
Alternatively, i could also just write a small documentation.
When using Distributions.Continuous.normal
the normal distribution is initialized by mu (mean) and tau (standard deviation). More common is to use sigma as descriptor for standard deviation.
Since tau is part of an alternative parameterization describing the reciprocal of the variance it may be misleading to some users.
Additionally the normal distribution often is initialized by N(mean,variance) rather than the mean and standard deviation, which increases the need for a precise description.
https://en.wikipedia.org/wiki/Normal_distribution#Alternative_parameterizations
Just interesting - what is the aim of the project?
There is a very great math library - Math,Net (althoug it's wrote on C#, there is F# API)
Is your feature request related to a problem? Please describe.
Set comparison is not covered in the codebase atm
Describe the solution you'd like
Addition of:
Description
add Matrix.countBy function
Interpolating polynomials are polynomials of degree=n-1. This degree is sufficient flexible to interpolate all present data points.
The construction of the linear system of equations is straight forward and does not require expensive construction of least-squares linear systems. Additionally the least-squares approach present in Fitting.LinearRegression.Polynomial
does not inevitably converge to a interpolating polynomial.
try to interpolate the following data with polynomial regression
let x_DataH = vector [0.;10.;30.;50.;70.;80.;82.]
let y_DataH = vector [150.;200.;200.;200.;180.;100.;0.]
An interpolating curve.
The first knot at x=0 is missed by the curve.
Construct a linear system of equations that has a unique solution and interpolates all given knots.
The Array.shuffle function assign the shuffled values to the original input array.
The input Array must not be modified.
The input array has do be copied before shuffling.
Module: FSharp.Stats.ML.Unsupervised.IterativeClustering
When working with kMeans clustering, the result type KClusteringResult contains a classifier, to assign data points to the nearest centroid. The description states, that the classifier returns the cluster index and the data point, but actually it returns the cluster index and the coordinates of the cluster centroid.
Since the data point is required for calling the classifier function, it may be reasonable to return the centroid coordinates. If so, the description of the record type has to be changed to "Classifier function returns cluster index and centroid"
Is your feature request related to a problem? Please describe.
Many functions lack collection type consistency, eg. array as parameter vector as return type. E.g. FSharp.Stats.Filtering.savitzky_golay.
Describe the solution you'd like
Refactoring of functions to avoid unnessesary collection conversions.
Pearson and similar correlation coefficients can be outlier sensitive. Biweight midcorrelation could was shown to be more robust against outliers in some cases.
It might therefore be a good addition to the correlation module.
Is your feature request related to a problem? Please describe.
Post hoc tests are used to inspect a multipe-group-comparison (e.g. ANOVA) for individual significant group mean differences.
While Tukey's HSD aims to perform all comparisons k((k-1)/2), it often is desired to test all groups against a single control or standard.
Describe the solution you'd like
Dunnett's test allows this multiple-to-one comparison with higher power than Tukeys HSD.
Additional information/References
https://www.researchgate.net/post/What_is_the_difference_between_Tukeys_Post_Hoc_Test_and_Students_t-test
Testing.Outliers.tukey
creates an dedicated Outliers
type consisting of a lower and upper border.
For convenience it would be beneficial to make use of the Interval type. Additionally the outlier detection may fit better to Signal
than Testing
.
SAM.S0.estimate returns values as percentages (value between 0. and 100.)
SAM.S0.estimate returns values as probability (value between 0. and 1.). As this is the standard behaviour in the repository
Interpolating cubic splines are able fit smooth curves through data points. For every interval in the data, a cubic polynomial with special boundary conditions is constructed. Thereby Runge's phenomenon is avoided to some extent.
It should be added to: Fitting.Spline.Interpolation
Cubic Spline Interpolation
Boundary conditions
Cubic spline online tool
Constrained Spline
Constrained monotone Spline
Is your feature request related to a problem? Please describe.
The LinearAlgebra module is a big mess. Many not implemented functions, naming has to be unified, unmanaged access to LAPACK functions are sparse.
Describe the solution you'd like
Subproject FSharp.Stats.LAPACK (already existing as placeholder project) should contain the Lapack stubs and function access. LinearAlgebra.fs naming has to follow a convention. Drop MKL support and embrace netlib lapack as it is open source. Unify output types of managed and unmanaged versions of linear algebra functions
Describe alternatives you've considered
Support MKL alongside Netlib LAPACK
To Reproduce
[|2.0; 3.0; 4.0; 5.0; 5.0; 5.0|]
|> Rank.rankAverage
returns
val it : float array = [|1.0; 2.0; 3.0; 2.666666667; 2.666666667; 2.666666667|]
Expected behavior
The result should be
val it : float array = [|1.0; 2.0; 3.0; 5.0; 5.0; 5.0|]
as the 3 fives should receive the ranks 4, 5 and 6 -> average = 5
Some modules in FSharp.Stats.ML.Unsupervised are not part of the project. The outdated dependencies have to be updated for reintegration.
To model growth curves, additional models such as Gompertz should be added to the NonlinearRegression model table.
Is your feature request related to a problem? Please describe.
It can become very complicated to derive e.g. an erfcx(x) function, so it can be nice to have an approximation.
Describe the solution you'd like
I want to implement the numerical differentiation function with a method to calculate an "optimal" window h, over which the differentiation is done.
Describe alternatives you've considered
MatLab can do the calculate the derivative of an erfcx(x)-function, but only for set numbers and not variables.
Additional context
Everything is set and ready, the PR is coming. 👍
If heteroscedasticity in the data is assumed, it would be beneficial if a weighted pearson correlation could be calculated in which points of high importance receive a higher impact.
Note: test if switching of x-, and y-data has an impact on correlation value!
Describe the bug
shuffling fast in a loop does yield the same arrays. There seems to be an issue with the random number generation in Random.rndgen
To Reproduce
code snippet:
for i=0 to 10 do
printfn "%A" (Array.shuffleFisherYates [|0 .. 10|])
Output :
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
Real: 00:00:00.009, CPU: 00:00:00.000, GC gen0: 1, gen1: 0, gen2: 0
val it : unit = ()
Expected behavior
Should be (pseudo) random. like this snippet:
for i=0 to 10 do
printfn "%A" (Array.init 11 (fun i -> [|0 .. 10|].[Random.rndgen.NextInt(10)]))
Output:
[|2; 5; 9; 3; 5; 2; 9; 4; 5; 9; 0|]
[|4; 9; 0; 1; 5; 0; 1; 4; 2; 6; 0|]
[|7; 8; 0; 5; 3; 7; 9; 7; 5; 2; 0|]
[|4; 7; 1; 4; 4; 7; 2; 1; 1; 6; 0|]
[|4; 0; 0; 9; 5; 8; 4; 1; 7; 5; 2|]
[|1; 8; 1; 3; 9; 9; 2; 4; 7; 3; 8|]
[|4; 4; 2; 9; 0; 5; 3; 2; 1; 4; 3|]
[|6; 4; 9; 5; 3; 1; 0; 3; 3; 4; 4|]
[|2; 6; 3; 3; 1; 7; 3; 8; 6; 9; 6|]
[|5; 6; 6; 7; 6; 8; 8; 7; 4; 4; 7|]
[|7; 1; 7; 3; 4; 5; 1; 3; 4; 0; 4|]
Real: 00:00:00.010, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : unit = ()
Kruskal-Wallis one-way ANOVA is good for comparing means of independent samples that are non-parametric.
Describe the bug
Given the following conditions:
Population size (N): 100
Successes in population (K)) 80
Samplesize (n): 50
Number of successes in sample (k): 20
The propability density function at k = 20 should return 0. because it is impossible to obtain 30 negatives (n-k) if there are only 20 negatives in the population (N-K).
Nevertheless the problem is valid and should return a probability.
To Reproduce
(Distributions.Discrete.hypergeometric 100 80 50).PDF 20
Expected behavior
Should return 0.
Possible fix
In Discrete.hypergeometric.PDF check if (N-K)<(n-k) and return 0.
Is your feature request related to a problem? Please describe.
Confidence and prediction bands offer a interesting insight into regression outcomes. While the fitted function (correct model selection is assumed) shows the best possibility of the current model, it does not show the confidence.
While confidence and prediction bands are implemented for simple linear regression, a generalized version for linear regression and non-linear regression would be beneficial.
Additional context
In ordinary least squares regression, fitting a straight regression line is heavily influenced by outliers. The Theil-Sen estimator is a non-parametric method to cope with data corrupted by outliers.
It takes the median of all slopes (and intercepts) between every pair of points in the data set as an estimator for the regression line (f(x)=mx+b).
Sen extended Theil's method by using only pairs of points having distinct x coordinates.
It should be added to: Fitting.LinearRegression.RobustRegression
FSharp.Stats.Distributions.Empirical.random makes wrong use of a random number generator
e.g.
[
for i = 1 to 100 do
yield random exampleDistribution
]
a splendid sample of numbers present in the example distribution.
the loop above will yield 100 times the same number because the random number generator is created each time when the function is called but everytime with the same seed. (I believe system time)
Using the random number logic of FSharp.Stats.
Is your feature request related to a problem? Please describe.
The calculation of confidence intervals is missing.
For polynomial fitting "leave-one-out cross validation" (LOOCV) is a simple model selection method for polynomial regression. It should be added to: Fitting.GoodnessOfFit.OrdinaryLeastSquares.Polynomial
When using int collections as input sequences the output can be wrong or NaN due to int overflow.
let x1 = Array.init 1000 id |> FSharp.Stats.Array.shuffleFisherYates
let x2 = Array.init 1000 id |> FSharp.Stats.Array.shuffleFisherYates
let y1 = x1 |> Array.map float
let y2 = x2 |> Array.map float
FSharp.Stats.Correlation.Seq.pearson y1 y2
FSharp.Stats.Correlation.Seq.pearson x1 x2 // gives a different output than above because of int overflow
let a1 = Array.init 10000 id |> FSharp.Stats.Array.shuffleFisherYates
let a2 = Array.init 10000 id |> FSharp.Stats.Array.shuffleFisherYates
let b1 = a1 |> Array.map float
let b2 = a2 |> Array.map float
FSharp.Stats.Correlation.Seq.pearson b1 b2
FSharp.Stats.Correlation.Seq.pearson a1 a2 // gives a NaN
Convert input sequences from int to float. Internal float conversion should be done at an earlier point.
For piping convenience the matix/vector input should be the last parameter in functions from modules 'Matrix' and 'Vector'
Example:
Matrix.getCols (m: Matrix<float>) (i: int)
Vector.get (v: Vector) (i: int)
should be
Matrix.getCols (i: int) (m: Matrix<float>)
Vector.get (i: int) (v: Vector)
add unit tests for Vector module
add unit tests for Matrix module
Before clean up and pushing implement unit tests
Compare parameter order with Core functions
If functions are renamed or parameter order is switched, keep the old naming with [<Obsolete("Do not use. Use [function name] instead.")>]
Tag
parameter renaming can be performed without further attention
Commonly the F test is used to check samples for homoscedasticity.
While the distribution and test statistic of the test itself is implemented, a FTest module that tests two samples for their homoscedasticity is missing. An additional documentation should cover theory and applications of the one- and two-tailed F test.
Is your feature request related to a problem? Please describe.
The FSharp.Stats.Signal namespace should implement a common interface to represent a peak. This will allow downstream algorithms to be implemented regardless of the peak detection algorithm used.
The CoD is a statistical measure of how well the regression approximates the real data points.
To avoid the excessive use of variables in order to reduce the resulting sum of squares (kitchen sink regression) the adjusted coefficient of determination was proposed.
The R²adj is defined as:
1-(1-R²)((n-1)/(n-p-1)) with n: number of observations, p: number of variables
It should be added to: Fitting.GoodnessOfFit
Since the Eigenvalue Decomposition of matrices was implemented in FSharp.Stats. There's no reason to have RMT.compute use Array2D as type for input matrices.
Therefore it should be changed to Matrix.
After generating a hierarchical cluster tree, there should be an easy way to prune the tree in order to receive groups of closely related items.
For this an aggregation function should be added.
The Continuous Wavelet Transform (CWT) is a multiresolution analysis method to gain insights into frequency components of a signal with simultaneous temporal classification. Wavelet in this context stands for small wave and describes a window function which is convoluted with the original signal at every position in time. Many wavelets exist, every one of them is useful for a certain application, thereby ‘searching’ for specific patterns in the data. By increasing the dimensions (scale) of the wavelet function, different frequency patterns are studied.
In contrast to the Fourier transform, that gives a perfect frequency resolution but no time resolution, the CWT is capable of mediating between the two opposing properties of time resolution and frequency resolution (Heisenberg's uncertainty principle).
The data has to be padded at the start and end with artificial data points in order to perform wavelet transforms in these ranges. The padding values can be:
Sometimes an internal padding has to be performed. Especially if the data is continuous and not discrete, an artificial addition of data points is necessary to ensure admissibility. The padding values can be:
Is your feature request related to a problem? Please describe.
The implemented version of the silhouette index expects the rawdata to be clustered by k means clustering. A function is missing, that takes an already clustered data set and returns the silhouette index.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.