fslaborg / fsharp.stats Goto Github PK

View Code? Open in Web Editor NEW

207.0 21.0 54.0 171.35 MB

statistical testing, linear algebra, machine learning, fitting and signal processing in F#

Home Page: https://fslab.org/FSharp.Stats/

License: Other

Batchfile 0.01% F# 99.20% Shell 0.01% PowerShell 0.02% Jupyter Notebook 0.77%

fsharp statistics numerics algebra linear-algebra mathematics

fsharp.stats's People

Stargazers

Watchers

Forkers

mikhayn labusr bellacapilla kristinakeuper pmenges etschbeijer forki wieczoreke zimmera goedels ma3yta kevinmoesker stjordanis guoxingmail thomasrizzo augustoproiete-forks luisquintanilla flowetzels qaysi-m nhirschey zyzhu hlweil koba140 edgarsanchez walternative zieglerse andyabok davidegcosta caroott whiteblackgoose ingted recktenwald omaus soerenhm falkenei vleidel fejung librachris freymaurer mirca carfl0w bvenn doganck smoothdeveloper thomascherickal simontreanor pirrmann valbers samo8 owenmadden s-weil topiaskarjalainen christtella kevmal

fsharp.stats's Issues

[Feature Request] add exponential model (Nonlinear Regression)

Description

In Fitting.NonlinearRegression a model for an exponential function in the form of "y(x)=a * exp(b * x)" is missing.
The linearization of an exponential function is often proposed, but underestimates the residuals at high y-values.

Additional information

Incorrect nonlinear curve fits in Excel

Package missing from nuget?

Description

The documentation specifies that the package is available on Nuget – but currently it doesn't seem to be.
Accessing the provided Nuget link, one get's a 404 – not found.

Repro steps

Visit the Nuget link: https://www.nuget.org/packages/FSharp.Stats
Realize package is missing

Alternatively:

Visit nuget.org
Search for FSharp.Stats
No results are returned: https://www.nuget.org/packages?q=FSharp.Stats

Thanks!

Encapsulate Managed Algebra Modules

Description

Some of the functions in module Algebra are callable from outside even though they should be encapsulated.

Example:
FSharp.Stats.Algebra.EVD.SymmetricEvd

Correlation p-Values [Feature Request]

Include p-Value calculation for correlation coefficients. The easiest way is to use a ttest

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Testing_using_Student's_t-distribution

[Feature Request] Add other ordination methods

Is your feature request related to a problem? Please describe.
Not really a problem. Currently the only implemented ordination method in FSharp.Stats is PCA. PCA is applied on data where a linear relationship is expected and the distance measure of choice is euclidean distance.

Describe the solution you'd like
Addition/implementation of the following Multidimensional Scaling algorithms:

Principal Coordinates Analysis (PCoA, Classical Multidimensional Scaling'):
- assumption of linear relationship
- is similar in spirit to PCA but it takes a dissimilarity as input.
- Using euclidean distance as distance measure produces the same outcome as PCA, but you can use others (e.g. log-fold change, Jacard)
Non-metric multidimensional scaling (nMDS):
- no linear relationship assumed
- superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization
- non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the distances between items, and the location of each item in the low-dimensional space.

[BUG] getPeaks gives error when calculating 1st polynomial order

Describe the bug
When using the FSharp.Stats.Signal.PeakDetection.SecondDerivative.getPeaks function for the 1st polynomial order in combination with the second derivative, it results in an "System.IndexOutOfRangeException: Index was outside the bounds of the array." - error.

To Reproduce
Steps to reproduce the behavior:

I looked into the code and tried to narrow the origin of the error down and found it was the fault of the
Signal.Filtering.savitzky_golay function used for calculating the second derivative and for data smoothing.

open FSharp.Stats

///a = height of the gaussians peak, b = mean of gaussian distri, c = sigma of gaussian distribution,
///x = window on x -axis
let gaussianFunctionForY height mean sigma x =  height*(exp( -((x-mean) ** 2.) / (2.*(sigma ** 2.)) ))


let XX = [|0. .. 0.1 .. 100.|]
let gaussianYY = XX |> Array.map (fun x -> gaussianFunctionForY 50. 30. 0.3 x)

getPeaks 0. 1 5 XX gaussianYY  ///error seen above

let smoothedData = 
    Signal.Filtering.savitzky_golay 5 1 0 1 gaussianYY
    |> Array.ofSeq        ///no error, works as intended

Signal.Filtering.savitzky_golay 5 1 2 1 smoothedData  ///error seen above

Signal.Filtering.savitzky_golay 5 1 2 1 gaussianYY  ///error seen above

Signal.Filtering.savitzky_golay 5 1 1 1 gaussianYY //no error

Signal.Filtering.savitzky_golay 5 2 2 1 gaussianYY //no error

Apparently, the savitzky_golay function gives an error at the following line, but i am not 100% sure if i read the debugger correctly.

let m = (Algebra.LinearAlgebraManaged.pseudoInvers b).Row(deriv) * ((float(rate)**float(deriv)) * SpecialFunctions.Factorial.factorial(deriv))

Expected behavior
Should smooth data, while trying to fit the data to first order polynomials on second derivative.

OS and framework information (please complete the following information):

OS: Windows 10 professional
OS Version 1809
.NET Core SDK 2.2.104
.NET Framework 4.7.2 SDK

Additional context
I am using the FSharp.Stats developer branch

Smoothing spline is limited to distinct x values

Description

Smoothing splines are used for nonparametric fitting. If many y values are associated with one x value, an automatic weighting can be applied.
When using Spline.smoothingSpline identical x values are filtered, while the method seems to be able to cope with the processing of duplicates.

Reference: https://robjhyndman.com/etc5410/splines.pdf

The knots (basispts) have to be distinct so a filtering/warning step is necessary.

[Feature Request] add weighted least squares (WLS)

Weighted least squares (WLS)

To reduce the impact of single data points to the regression, it is possible to assign weightings to the data (e.g. based on replicate variance or signal intensity).
Caution: Negative weigths are note allowed
It should be added to: Fitting.LinearRegression.OrdinaryLeastSquares.Polynomial

Related information

Curve and Surface Fitting, Academic Press, 1987 pp. 55

add hypergeometric PDF support for big numbers

Description

When using the hypergeometric distribution, big input numbers should be processed by using
exp (Binomial.coeffcientLn K k) rather than Binomial.coeffcient K k.

Repro steps

(hypergeometric 19714 29 1643).PDF 1 reports nan while
(hypergeometric 19714 29 1643).CDF 1. - (hypergeometric 19714 29 1643).CDF 0. reports 0.21.

Known workarounds

A change of
(SpecialFunctions.Binomial.coeffcient K k) * (SpecialFunctions.Binomial.coeffcient (N-K) (n-k)) / (SpecialFunctions.Binomial.coeffcient N n)
to
exp ((SpecialFunctions.Binomial.coeffcientLn K k) + (SpecialFunctions.Binomial.coeffcientLn (N-K) (n-k)) - SpecialFunctions.Binomial.coeffcientLn N n)
in Distributions.Discrete line 191 should fix this issue.

[BUG] total sum of squares calculation reports regression sum of squares value

Describe the bug

GoodnessOfFit.calculateSST does not report the total sum of squares but the regression sum of squares.
The following procedure reports the correct value:

let sos = GoodnessOfFit.calculateSumOfSquares fitFunc x_Data y_Data
sos.Total

[BUG] Distributions.Empirical.create produces strange propabilities

Describe the bug
I tried using FSharp.Stats.Distributions.Empirical.create 1. dataPoints to bin my dataPoints with a bandWidth of 1. Afterward i wanted to sample from the distribution via Distributions.Empirical.random but it returned the following error every other time

> System.Collections.Generic.KeyNotFoundException: An index satisfying the predicate was not found in the collection.
   at Microsoft.FSharp.Collections.SeqModule.Find[T](FSharpFunc`2 predicate, IEnumerable`1 source)
   at <StartupCode$FSI_0249>.$FSI_0249.main@() in D:\Freym\Source\Repos\Freymaurer\Jupyter_PraktikumBiotech\Scripts\JP06_Retention_time_and_scan_time.fsx:line 86
Stopped due to error

To Reproduce
Steps to reproduce the behavior:

let file = "Chlamy_JGI5_5(Cp_Mp).fasta" // can be found in BioFSharp.Mz

let sequences = 
    filePath
    |> FastA.fromFile BioArray.ofAminoAcidString
    |> Seq.toArray

let digestedProteins =
    sequences
    |> Array.mapi (fun i fastAItem ->
        Digestion.BioArray.digest Digestion.Table.Trypsin i fastAItem.Sequence
        |> Digestion.BioArray.concernMissCleavages 0 0
    )
    |> Array.concat

let digestedPeptideMasses =
    digestedProteins
    |> Array.choose (fun peptide ->
        let mass = BioSeq.toMonoisotopicMassWith (BioItem.monoisoMass ModificationInfo.Table.H2O) peptide.PepSequence
        if mass < 3000. then Some mass else None
    )

let empHis = FSharp.Stats.Distributions.Empirical.create 1. digestedPeptideMasses

Expected behavior
Should produce a distribution, from which one can randomly sample according to the propabilities.

Additional context
I already looked in the FSharp.Stats.Distribution.Empirical module and think that the Distributions.Empirical.random function is just fine, but that the FSharp.Stats.Distributions.Empirical.create function is not correct when calculating the area subfunction. It seems unintuitive.

For my example above the total of all summed up propabilities was float = 0.6921621451, so everytime the random generated target subfunction of Distributions.Empirical.random produced an value above the ~0.69 it returned an error.

[Refactoring] redundant Levenberg Marquardt implemenations

Is your feature request related to a problem? Please describe.
The developer branch is now (5c1d95a) containing a levenberg marquardt variant that supports box constrains and an classic implementation.

Describe the solution you'd like
Refactor the code yielding a single implementation that can be run constrained and unconstrained.

[Feature Request] Distribution distance metrics

Distribution distance metrics are a good way to determine which distributions are more similar to each other based just on the absolute values.

Rework of PvalueAdjust Module

Is your feature request related to a problem? Please describe.

During exercises it became clear, that the functions in this module miss certain functionalities and also a proper documentation.

rework Benjamini-Hochberg
write documentation

Describe the solution you'd like

I will try and extend the current module with additional functions and documentation while also keeping the current function as they are right now.

Describe alternatives you've considered
Alternatively, i could also just write a small documentation.

Change parameter name in normal distribution from tau to sigma

Description

When using Distributions.Continuous.normal the normal distribution is initialized by mu (mean) and tau (standard deviation). More common is to use sigma as descriptor for standard deviation.

Since tau is part of an alternative parameterization describing the reciprocal of the variance it may be misleading to some users.

Additionally the normal distribution often is initialized by N(mean,variance) rather than the mean and standard deviation, which increases the need for a precise description.

Related information

https://en.wikipedia.org/wiki/Normal_distribution#Alternative_parameterizations

The aim of project

Just interesting - what is the aim of the project?
There is a very great math library - Math,Net (althoug it's wrote on C#, there is F# API)

[Feature Request] Add Set distance/similarity/dissimilarity measures

Is your feature request related to a problem? Please describe.
Set comparison is not covered in the codebase atm

Describe the solution you'd like
Addition of:

Overlap Coefficient
Jaccard Index
Sørensen–Dice coefficient
Tversky index

[Feature Request] add Matrix.countBy

Description
add Matrix.countBy function

[Feature Request] Interpolating Polynomials

Description

Interpolating polynomials are polynomials of degree=n-1. This degree is sufficient flexible to interpolate all present data points.
The construction of the linear system of equations is straight forward and does not require expensive construction of least-squares linear systems. Additionally the least-squares approach present in Fitting.LinearRegression.Polynomial does not inevitably converge to a interpolating polynomial.

Repro steps

try to interpolate the following data with polynomial regression
let x_DataH = vector [0.;10.;30.;50.;70.;80.;82.]
let y_DataH = vector [150.;200.;200.;200.;180.;100.;0.]

Expected behavior

An interpolating curve.

Actual behavior

The first knot at x=0 is missed by the curve.

Solution

Construct a linear system of equations that has a unique solution and interpolates all given knots.

Related information

Presentation_Seminar

Lecture

[Bug] shuffle functions modify input data

Description

The Array.shuffle function assign the shuffled values to the original input array.

affected functions´:

Array.shuffle

Expected behavior

The input Array must not be modified.

Solution

The input array has do be copied before shuffling.

Classfier of KClusteringResult does not report data point

Description

Module: FSharp.Stats.ML.Unsupervised.IterativeClustering

When working with kMeans clustering, the result type KClusteringResult contains a classifier, to assign data points to the nearest centroid. The description states, that the classifier returns the cluster index and the data point, but actually it returns the cluster index and the coordinates of the cluster centroid.

Since the data point is required for calling the classifier function, it may be reasonable to return the centroid coordinates. If so, the description of the record type has to be changed to "Classifier function returns cluster index and centroid"

[Feature Request] Improve type consistency

Is your feature request related to a problem? Please describe.
Many functions lack collection type consistency, eg. array as parameter vector as return type. E.g. FSharp.Stats.Filtering.savitzky_golay.

Describe the solution you'd like
Refactoring of functions to avoid unnessesary collection conversions.

[Feature Request] Biweight Midcorrelation

Description

Pearson and similar correlation coefficients can be outlier sensitive. Biweight midcorrelation could was shown to be more robust against outliers in some cases.
It might therefore be a good addition to the correlation module.

[Feature Request] Dunnett's test

Is your feature request related to a problem? Please describe.
Post hoc tests are used to inspect a multipe-group-comparison (e.g. ANOVA) for individual significant group mean differences.
While Tukey's HSD aims to perform all comparisons k((k-1)/2), it often is desired to test all groups against a single control or standard.

Describe the solution you'd like
Dunnett's test allows this multiple-to-one comparison with higher power than Tukeys HSD.

Additional information/References
https://www.researchgate.net/post/What_is_the_difference_between_Tukeys_Post_Hoc_Test_and_Students_t-test

Move Outliers.tukey to Signal and change result type to Interval

Description

Testing.Outliers.tukey creates an dedicated Outliers type consisting of a lower and upper border.

For convenience it would be beneficial to make use of the Interval type. Additionally the outlier detection may fit better to Signal than Testing.

SAM.S0.estimate returns values as percentages

Description

SAM.S0.estimate returns values as percentages (value between 0. and 100.)

Expected behavior

SAM.S0.estimate returns values as probability (value between 0. and 1.). As this is the standard behaviour in the repository

[Feature Request] add interpolating cubic splines

add interpolating cubic splines

Interpolating cubic splines are able fit smooth curves through data points. For every interval in the data, a cubic polynomial with special boundary conditions is constructed. Thereby Runge's phenomenon is avoided to some extent.

It should be added to: Fitting.Spline.Interpolation

Related information

Cubic Spline Interpolation
Boundary conditions
Cubic spline online tool
Constrained Spline
Constrained monotone Spline

[Feature Request] Refactor Linear algebra service

Is your feature request related to a problem? Please describe.
The LinearAlgebra module is a big mess. Many not implemented functions, naming has to be unified, unmanaged access to LAPACK functions are sparse.

Describe the solution you'd like
Subproject FSharp.Stats.LAPACK (already existing as placeholder project) should contain the Lapack stubs and function access. LinearAlgebra.fs naming has to follow a convention. Drop MKL support and embrace netlib lapack as it is open source. Unify output types of managed and unmanaged versions of linear algebra functions

Describe alternatives you've considered
Support MKL alongside Netlib LAPACK

[BUG] Rank.rankAverage returns wrong results in some cases

To Reproduce

[|2.0; 3.0; 4.0; 5.0; 5.0; 5.0|]
|> Rank.rankAverage

returns

val it : float array = [|1.0; 2.0; 3.0; 2.666666667; 2.666666667; 2.666666667|]

Expected behavior

The result should be

val it : float array = [|1.0; 2.0; 3.0; 5.0; 5.0; 5.0|]

as the 3 fives should receive the ranks 4, 5 and 6 -> average = 5

Reintegration of outdated modules

Description

Some modules in FSharp.Stats.ML.Unsupervised are not part of the project. The outdated dependencies have to be updated for reintegration.

DensityClustering.fs
GapStatistics.fs
GapStatistics_seq.fs
IterativeClusterin_seq.fs

[Feature Request] Additional growth curve models

To model growth curves, additional models such as Gompertz should be added to the NonlinearRegression model table.

Add additional "Numerical Differentiation" - functions

Is your feature request related to a problem? Please describe.
It can become very complicated to derive e.g. an erfcx(x) function, so it can be nice to have an approximation.

Describe the solution you'd like
I want to implement the numerical differentiation function with a method to calculate an "optimal" window h, over which the differentiation is done.

Describe alternatives you've considered
MatLab can do the calculate the derivative of an erfcx(x)-function, but only for set numbers and not variables.

Additional context
Everything is set and ready, the PR is coming. 👍

[Feature Request] weighted pearson correlation

Description

If heteroscedasticity in the data is assumed, it would be beneficial if a weighted pearson correlation could be calculated in which points of high importance receive a higher impact.

Note: test if switching of x-, and y-data has an impact on correlation value!

Related information

weighted pearson

[BUG]Array.shuffleFisherYates does not seem to be too random at all

Describe the bug
shuffling fast in a loop does yield the same arrays. There seems to be an issue with the random number generation in Random.rndgen

To Reproduce
code snippet:


for i=0 to 10 do
    printfn "%A" (Array.shuffleFisherYates [|0 .. 10|])

Output :
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
[|0; 5; 8; 4; 9; 6; 1; 3; 2; 7; 10|]
Real: 00:00:00.009, CPU: 00:00:00.000, GC gen0: 1, gen1: 0, gen2: 0
val it : unit = ()

Expected behavior
Should be (pseudo) random. like this snippet:


for i=0 to 10 do
    printfn "%A" (Array.init 11 (fun i -> [|0 .. 10|].[Random.rndgen.NextInt(10)]))

Output:
[|2; 5; 9; 3; 5; 2; 9; 4; 5; 9; 0|]
[|4; 9; 0; 1; 5; 0; 1; 4; 2; 6; 0|]
[|7; 8; 0; 5; 3; 7; 9; 7; 5; 2; 0|]
[|4; 7; 1; 4; 4; 7; 2; 1; 1; 6; 0|]
[|4; 0; 0; 9; 5; 8; 4; 1; 7; 5; 2|]
[|1; 8; 1; 3; 9; 9; 2; 4; 7; 3; 8|]
[|4; 4; 2; 9; 0; 5; 3; 2; 1; 4; 3|]
[|6; 4; 9; 5; 3; 1; 0; 3; 3; 4; 4|]
[|2; 6; 3; 3; 1; 7; 3; 8; 6; 9; 6|]
[|5; 6; 6; 7; 6; 8; 8; 7; 4; 4; 7|]
[|7; 1; 7; 3; 4; 5; 1; 3; 4; 0; 4|]
Real: 00:00:00.010, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : unit = ()

[Feature Request] H-test (Kruskal-Wallis one-way ANOVA)

Kruskal-Wallis one-way ANOVA is good for comparing means of independent samples that are non-parametric.

[BUG] hypergeometric distribution fails with valid input

Describe the bug

Given the following conditions:
Population size (N): 100
Successes in population (K)) 80
Samplesize (n): 50
Number of successes in sample (k): 20

The propability density function at k = 20 should return 0. because it is impossible to obtain 30 negatives (n-k) if there are only 20 negatives in the population (N-K).

Nevertheless the problem is valid and should return a probability.

To Reproduce

(Distributions.Discrete.hypergeometric 100 80 50).PDF 20

Expected behavior

Should return 0.

Possible fix

In Discrete.hypergeometric.PDF check if (N-K)<(n-k) and return 0.

[Feature Request] Confidence bands for polynomial and nonlinear regression

Is your feature request related to a problem? Please describe.
Confidence and prediction bands offer a interesting insight into regression outcomes. While the fitted function (correct model selection is assumed) shows the best possibility of the current model, it does not show the confidence.
While confidence and prediction bands are implemented for simple linear regression, a generalized version for linear regression and non-linear regression would be beneficial.

Additional context

[Feature Request] add Theil-Sen estimator

add Theil-Sen estimator

In ordinary least squares regression, fitting a straight regression line is heavily influenced by outliers. The Theil-Sen estimator is a non-parametric method to cope with data corrupted by outliers.
It takes the median of all slopes (and intercepts) between every pair of points in the data set as an estimator for the regression line (f(x)=mx+b).
Sen extended Theil's method by using only pairs of points having distinct x coordinates.
It should be added to: Fitting.LinearRegression.RobustRegression

add Theil's incomplete method
add Theil-Sen estimator

Related information

Theil's incomplete method

FSharp.Stats.Distributions.Empirical.random yields the same number if called in a short time frame

Description

FSharp.Stats.Distributions.Empirical.random makes wrong use of a random number generator

How to reproduce

e.g.
[
for i = 1 to 100 do
yield random exampleDistribution
]

Expected behavior

a splendid sample of numbers present in the example distribution.

Actual behavior

the loop above will yield 100 times the same number because the random number generator is created each time when the function is called but everytime with the same seed. (I believe system time)

Fixed by

Using the random number logic of FSharp.Stats.

[Feature Request] Confidence interval calculation

Is your feature request related to a problem? Please describe.

The calculation of confidence intervals is missing.

confidence interval calculation
confidence mean deviation determination

[Feature Request] add LOOCV

Leave-one-out cross validation

For polynomial fitting "leave-one-out cross validation" (LOOCV) is a simple model selection method for polynomial regression. It should be added to: Fitting.GoodnessOfFit.OrdinaryLeastSquares.Polynomial

add leave-one-out cross validation
add leave-k-out cross validation

[BUG] Pearson correlation fails with long int input

Description

When using int collections as input sequences the output can be wrong or NaN due to int overflow.

Repro steps

let x1 = Array.init 1000 id |> FSharp.Stats.Array.shuffleFisherYates
let x2 = Array.init 1000 id |> FSharp.Stats.Array.shuffleFisherYates
let y1 = x1 |> Array.map float
let y2 = x2 |> Array.map float

FSharp.Stats.Correlation.Seq.pearson y1 y2
FSharp.Stats.Correlation.Seq.pearson x1 x2 // gives a different output than above because of int overflow

let a1 = Array.init 10000 id |> FSharp.Stats.Array.shuffleFisherYates
let a2 = Array.init 10000 id |> FSharp.Stats.Array.shuffleFisherYates
let b1 = a1 |> Array.map float
let b2 = a2 |> Array.map float

FSharp.Stats.Correlation.Seq.pearson b1 b2
FSharp.Stats.Correlation.Seq.pearson a1 a2 // gives a NaN

Known workarounds

Convert input sequences from int to float. Internal float conversion should be done at an earlier point.

Switch order of parameters in modules 'Matrix' and 'Vector'

Description

For piping convenience the matix/vector input should be the last parameter in functions from modules 'Matrix' and 'Vector'

Example:

Matrix.getCols (m: Matrix<float>) (i: int) 
Vector.get (v: Vector) (i: int)

should be

Matrix.getCols (i: int) (m: Matrix<float>)
Vector.get (i: int) (v: Vector)

add unit tests for Vector module
add unit tests for Matrix module
Before clean up and pushing implement unit tests
Compare parameter order with Core functions
If functions are renamed or parameter order is switched, keep the old naming with [<Obsolete("Do not use. Use [function name] instead.")>] Tag
parameter renaming can be performed without further attention

[Feature Request] F-test

Commonly the F test is used to check samples for homoscedasticity.
While the distribution and test statistic of the test itself is implemented, a FTest module that tests two samples for their homoscedasticity is missing. An additional documentation should cover theory and applications of the one- and two-tailed F test.

FTest.calculate
Documentation

[Feature Request]

Is your feature request related to a problem? Please describe.
The FSharp.Stats.Signal namespace should implement a common interface to represent a peak. This will allow downstream algorithms to be implemented regardless of the peak detection algorithm used.

[Feature Request] add adjusted coefficient of determination

Coefficient of Determination

The CoD is a statistical measure of how well the regression approximates the real data points.
To avoid the excessive use of variables in order to reduce the resulting sum of squares (kitchen sink regression) the adjusted coefficient of determination was proposed.

The R²adj is defined as:
1-(1-R²)((n-1)/(n-p-1)) with n: number of observations, p: number of variables

It should be added to: Fitting.GoodnessOfFit

Related Information

Adjusted R²

Inputmatrix in Testing.RMT should be of type Matrix

Description

Since the Eigenvalue Decomposition of matrices was implemented in FSharp.Stats. There's no reason to have RMT.compute use Array2D as type for input matrices.

Therefore it should be changed to Matrix.

[Feature Request] Aggregation/Pruning function for hierarchical clusters

After generating a hierarchical cluster tree, there should be an easy way to prune the tree in order to receive groups of closely related items.
For this an aggregation function should be added.

[Feature Request] Data padding and continuous wavelet transform

Description

The Continuous Wavelet Transform (CWT) is a multiresolution analysis method to gain insights into frequency components of a signal with simultaneous temporal classification. Wavelet in this context stands for small wave and describes a window function which is convoluted with the original signal at every position in time. Many wavelets exist, every one of them is useful for a certain application, thereby ‘searching’ for specific patterns in the data. By increasing the dimensions (scale) of the wavelet function, different frequency patterns are studied.

In contrast to the Fourier transform, that gives a perfect frequency resolution but no time resolution, the CWT is capable of mediating between the two opposing properties of time resolution and frequency resolution (Heisenberg's uncertainty principle).

The data has to be padded at the start and end with artificial data points in order to perform wavelet transforms in these ranges. The padding values can be:

zero (chromatogram)
random y-values from the original data

Sometimes an internal padding has to be performed. Especially if the data is continuous and not discrete, an artificial addition of data points is necessary to ensure admissibility. The padding values can be:

zero (chromatogram)
random y-values from the original data
linear interpolation of adjacent points

Required modules/functions

Padding: borders
Padding: internal
Padding: big gaps
Estimation of minimal distance parameter
Continuous wavelet transform for continuous data
Continuous wavelet transform for discrete data (improves efficiency)
Padding documentation
CWT documentation

Related information

Wavelet tutorial

[Feature Request] Silhouette index for already clustered data

Is your feature request related to a problem? Please describe.

The implemented version of the silhouette index expects the rawdata to be clustered by k means clustering. A function is missing, that takes an already clustered data set and returns the silhouette index.

fslaborg / fsharp.stats Goto Github PK

fsharp.stats's People

Stargazers

Watchers

Forkers

fsharp.stats's Issues

Description

Additional information

Description

Repro steps

Description

Description

Weighted least squares (WLS)

Related information

Description

Repro steps

Known workarounds

Description

Related information

Description

Repro steps

Expected behavior

Actual behavior

Solution

Related information

Description

affected functions´:

Expected behavior

Solution

Description

Description

Description

Description

Expected behavior

add interpolating cubic splines

Related information

Description

Description

Related information

add Theil-Sen estimator

Related information

Description

How to reproduce

Expected behavior

Actual behavior

Fixed by

Leave-one-out cross validation

Description

Repro steps

Known workarounds

Description

Coefficient of Determination

Related Information

Description

Description

Required modules/functions

Related information

Recommend Projects

Recommend Topics

Recommend Org