Giter Site home page Giter Site logo

data2semantics / powerlaws Goto Github PK

View Code? Open in Web Editor NEW
21.0 3.0 9.0 30 KB

Java library for the analysis of power law distributed data. Scroll down for README.

Home Page: http://peterbloem.nl

License: MIT License

Java 100.00%
power-laws clauset estimate statistics data-science

powerlaws's Introduction

Power laws

This is a small java library for the analysis of power law distributed data. The methods implemented are all taken form the paper Power-Law Distributions in Empirical Data by Clauset, Shalizi and Newman (2007) and its reference implementations available at http://tuvalu.santafe.edu/~aaronc/powerlaws/

It contains facilities for estimating parameters, uncertainty and significance and for generating power law distributed data. All methods are implemented for continuous data, discrete data and the approximation of discrete data with a continuous distribution.

Installation

You can use jitpack.io to add the powerlaws library to your project. Simply add the following repository to your pom.xml:

	<repositories>
		<repository>
		    <id>jitpack.io</id>
		    <url>https://jitpack.io</url>
		</repository>
	</repositories>

And the following dependency

	<dependency>
	    <groupId>com.github.Data2Semantics</groupId>
	    <artifactId>powerlaws</artifactId>
	    <version>v0.1.0</version>
	</dependency>

Check the releases for the version number of the latest release.

The Basics

The main thing you're likely to want to do is estimate the exponent for some data. Say you have your data as a Collection of double values called 'data'. You can estimate the exponent as follows:

double exponent = Continuous.fit(data).fit().exponent();

The second fit() method above actually returns a representation of a whole continuous power law distribution, which you can use to do various things:

Continuous distribution = Continuous.fit(data).fit();

// * Retrieve the distribution's parameters
double exponent = distribution.exponent();
double xMin = distribution.xMin();

// * Generate some synthetic data
List<Double> generated = distribution.generate(1000);

Of course, if you have a specific distribution in mind, you can create the Continuous object directly:

// * Create a continuous power law distribution with xMin 340 and exponent 2.5 
Continuous distribution = new Continuous(340, 2.5);

If your data is discrete (ie. integers) you can use the class Discrete in an analogous way. Since the methods in Discrete are a bit slower than those in Continuous, you can also use DiscreteApproximate, which approximates a discrete power law with a continuous one. According to Clauset, this use is justified if xMin is larger than 6. For xMin > 1000 you can safely use the approximate version.

#The Details

Uncertainty

The uncertainty measures, in a sense, the robustness of our estimates. If a small change in the data leads to a massively different value, then we cannot rely on it, or we must gather more data.

We estimate the uncertainty by bootstrapping. We repeat the experiment a large number of times (1000 to 10000) with data sets that are sampled with replacement from the original data set. For each we record the measured exponent, the xMin parameter and the number of points in the tail. In each case the sample standard deviation of the recorded values is the uncertainty.

Varying xMin

The method of fitting described above uses a maximum likelihood estimator to estimate the exponent for a given xMin parameter. It then chooses as xMin the data point for which the resulting distribution has the smallest KS statistic to the data. If you want to test different values of xMin, you can use the Fit object which represents the intermediate stage of fitting a power law to data:

// * Create a Fit object
Fit<Double, PowerLaw<Double>> fit = Continuous.fit(data);

// * Print out the distances for all of the data points
for(double datum : data)
	System.out.println(datum + ": " + fit.fit(datum)); 

Significance

Any data can have a power law fit to it. To ascertain whether the power law is a proper model, we must calculate the significance. To do this, we generate a large number of synthetic data sets like our original data set and fit a model to it. We then calculate the KS statistic between the generated data and its model. The ratio of generated data sets with a distance higher than that of our fit is the probability of seeing a fit as 'bad' as ours given that the data comes from a power law distribution.

Continuous model = Continuous.fit(data).fit();

// * Calculate the significance based on 100 trials
double significance = model.significance(data, 1000);
// * Calculate the significance to an accuracy of 0.01
double significance = model.significance(data, 0.01);

The calculation of the significance is by far the most expensive process. For a large data set, the process an easily take hours. Note that this is a significance test for the the power law hypothesis (rather than the null hypothesis) so that high values mean that the power law is a good fit. Clauset et al. suggest that for values below 0.01 the power law hypothesis should be rejected.

Randomness

All random numbers come from PowerLaws.random. By default this Random has a fixed seed so that runs can be deterministically repeated. If this behaviour is not desired, you can give the Random a random seed:

PowerLaws.random = new Random();

The KS Test

The Kolmogorov-Smirnov test is used to estimate the xMin parameter. There is a small difference between the reference implementation of Clauset 2007 and the official definition of the KS test. The change does not cause a great difference in measured values (within uncertainties) but it is important when replicating the results in Clauset 2007.

The constant PowerLaws.KS_CORRECT determines which version of the KS test is used. If true, the theoretically correct version is used, if false, the version based on the reference implementation is used.

Contact

For bug fixes and suggestions, GitHub is preferred (send pbloem a message or create a ticket). If you don't have a GitHub account you can email to p & peterbloem & nl. (replacing the ampersands with an at symbol and a dot respectively).

License

This library is released under the MIT license. You are free to create derivative works, including for commercial purposes, and to re-release under another license. Attribution is not required, but it is always appreciated. See the file LICENSE.txt for the full text of the license.

githalytics.com alpha

powerlaws's People

Contributors

dependabot[bot] avatar pbloem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

powerlaws's Issues

How to add Data2Semantics/powerlaws in the pom.xml?

Hi ,there, I want user powerlaws to caculate the network metric, I found that when I add it in the pom.xml file it went wrong

 <dependency>
        <groupId>data2semantics</groupId>
        <artifactId>powerlaws</artifactId>
        <version>0.0.1-SNAPSHOT</version>
 </dependency> 

have not published?

Off-by-one error in Continuous.p?

According to the paper of Clauset et al. (both the 2007 and the 2009 version), the PDF of a continuous power-law distribution is defined as (alpha-1)/x_min * (x/x_min)^(-alpha). The current implementation (line 43 of Continuous.java) does not match that definition:
return (exponent() / xMin()) * Math.pow(x / xMin(), -exponent());
Should it be changed to:
return ((exponent() - 1.0) / xMin()) * Math.pow(x / xMin(), -exponent());
?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.