milcent / benford_py Goto Github PK

View Code? Open in Web Editor NEW

150.0 13.0 51.0 10.77 MB

Python implementation of Benford's Law tests.

License: BSD 3-Clause "New" or "Revised" License

Python 13.85% Jupyter Notebook 86.15%

python python3 benford benfords-law benford-compliant digit simon-newcomb auditing financial-analysis compliance

benford_py's Introduction

Benford for Python

Citing

If you find Benford_py useful in your research, please consider adding the following citation:

@misc{benford_py,
      author = {Marcel, Milcent},
      title = {{Benford_py: a Python Implementation of Benford's Law Tests}},
      year = {2017},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/milcent/benford_py}},
}

current version = 0.5.0

See release notes for features in this and in older versions

Python versions >= 3.6

Installation

Benford_py is a package in PyPi, so you can install with pip:

pip install benford_py

pip install benford-py

Or you can cd into the site-packages subfolder of your python distribution (or environment) and git clone from there:

git clone https://github.com/milcent/benford_py

For a quick start, please go to the Demo notebook, in which I show examples on how to run the tests with the SPY (S&P 500 ETF) daily returns.

For more fine-grained details of the functions and classes, see the docs.

Background

The first digit of a number is its leftmost digit.

Since the first digit of any number can range from "1" to "9" (not considering "0"), it would be intuitively expected that the proportion of each occurrence in a set of numerical records would be uniformly distributed at 1/9, i.e., approximately 0.1111, or 11.11%.

Benford's Law, also known as the Law of First Digits or the Phenomenon of Significant Digits, is the finding that the first digits of the numbers found in series of records of the most varied sources do not display a uniform distribution, but rather are arranged in such a way that the digit "1" is the most frequent, followed by "2", "3", and so in a successive and decremental way down to "9", which presents the lowest frequency as the first digit.

The expected distributions of the First Digits in a Benford-compliant data set are the ones shown below:

The first record on the subject dates from 1881, in the work of Simon Newcomb, an American-Canadian astronomer and mathematician, who noted that in the logarithmic tables the first pages, which contained logarithms beginning with the numerals "1" and "2", were more worn out, that is, more consulted.

Simon Newcomb, 1835-1909.

In that same article, Newcomb proposed the formula for the probability of a certain digit "d" being the first digit of a number, given by the following equation.

where: P (D = d) is the probability that the first digit is equal to d, and d is an integer ranging from 1 to 9.

In 1938, the American physicist Frank Benford revisited the phenomenon, which he called the "Law of Anomalous Numbers," in a survey with more than 20,000 observations of empirical data compiled from various sources, ranging from areas of rivers to molecular weights of chemical compounds, including cost data, address numbers, population sizes and physical constants. All of them, to a greater or lesser extent, followed such distribution.

Frank Albert Benford, Jr., 1883-1948.

The extent of Benford's work seems to have been one good reason for the phenomenon to be popularized with his name, though described by Newcomb 57 years earlier.

Derivations of the original formula were also applied in the expected findings of the proportions of digits in other positions in the number, as in the case of the second digit (BENFORD, 1938), as well as combinations, such as the first two digits of a number (NIGRINI, 2012, p.5).

Only in 1995, however, was the phenomenon proven by Hill. His proof was based on the fact that numbers in data series following the Benford Law are, in effect, "second generation" distributions, ie combinations of other distributions. The union of randomly drawn samples from various distributions forms a distribution that respects Benford's Law (HILL, 1995).

When grouped in ascending order, data that obey Benford's Law must approximate a geometric sequence (NIGRINI, 2012, page 21). From this it follows that the logarithms of this ordered series must form a straight line. In addition, the mantissas (decimal parts) of the logarithms of these numbers must be uniformly distributed in the interval [0,1] (NIGRINI, 2012, p.10).

In general, a series of numerical records follows Benford's Law when (NIGRINI, 2012, p.21):

it represents magnitudes of events or events, such as populations of cities, flows of water in rivers or sizes of celestial bodies;
it does not have pre-established minimum or maximum limits;
it is not made up of numbers used as identifiers, such as identity or social security numbers, bank accounts, telephone numbers; and
its mean is less than the median, and the data is not concentrated around the mean.

It follows from this expected distribution that, if the set of numbers in a series of records that usually respects the Law shows a deviation in the proportions found, there may be distortions, whether intentional or not.

Benford's Law has been used in several fields. Afer asserting that the usual data type is Benford-compliant, one can study samples from the same data type tin search of inconsistencies, errors or even fraud.

This open source module is an attempt to facilitate the performance of Benford's Law-related tests by people using Python, whether interactively or in an automated, scripting way.

It uses the versatility of numpy and pandas, along with matplotlib for vizualization, to deliver results like the one bellow and much more.

It has been a long time since I last tested it in Python 2. The death clock has stopped ticking, so officially it is for Python 3 now. It should work on Linux, Windows and Mac, but please file a bug report if you run into some trouble.

Also, if you have some nice data set that we can run these tests on, let'us try it.

Thanks!

Milcent

benford_py's People

Contributors

Stargazers

Watchers

Forkers

jfrfonseca leonardoalcantara justine0731 kdunn926 numaflores h2oai gilbertobotaro armandolicurgo accardoso kathreftisai jinhuli mirekphd gurumaia platikanova mrkjhsn m3ssilva im-alexandre rahulissar uditdeshpande dbpgz guglielmosanchini vigsterkr gfechio vicnoo danilapilin javiergarciafronti easyrider mingdazheng pbanikk cybniv vicmochengo pgbhat geremia glynne-dewar pablo-lamtenzan enbeghan disha-l cassandra-3 trendingtechnology troyam rlorenzo93 theskallywag edologgerbird standardgalactic servidorescloud sharmarahul20 gmineo hyandell zekelhealthcare marcejav dfgomezc

benford_py's Issues

Confidence @propoerty

Decorate the confidence atribute with the @Property decorator to make it easy to update it.

Divide by zero warning from np.log in Bhattacharrya Distance

Similar to #52 ,

Z score returns DivisionByZeroError when N = 0.

Hello. I hope this is the correct way to report issues I have run into using your program.

When i try using the Benford on a dataset containing only single integers it returns DivisionByZero error when calculating F2D/F3D/SD/L2D simply because the Base() class which transforms data places -1 for all values. Thus giving N = 0.

To bypass this I have simply made my test data contain values > 1000. However, I cant manipulate produktion data. As of today, there is no issues as every data-set i test Benford() on luckily contain at least 1 data point with a value > 1000.
I don't like having to specifically TRY/CATCH each time i use Benford, simply because it calculates for all digits, ie. F1D, F2D, F3D, SD, L2D.

I think there should be an input to Benford() stating which digits to test against?

`def Z_score(frame, N):
"""Computes the Z statistics for the proportions studied

    Args:
        frame: DataFrame with the expected proportions and the already calculated
            Absolute Diferences between the found and expeccted proportions
        N: sample size

    Returns:
        Series of computed Z scores
    """

  return (frame.AbsDif - (1 / (2 * N))) / sqrt(

           (frame.Expected * (1. - frame.Expected)) / N)

E ZeroDivisionError: division by zero`

Hope this makes sence, there is nothing wrong with the Z_score function. The issue is how Benford() tries to test against fx. F3D even if there is no 3 digit number in the data set.

Fix integer data input error

Correct dtypes to check to int64
Adapt ValuError message
Update version to 0.1.0.4

General Documentation

High Z score and its intrepretation

data-0.txt

I am new to Benford law.
I tested whether the third column of data-0.txt obeys the benford law and i got very big like Z score like 26, 86, 226.
Is it normal to get? or am i doing anything wrong?
If it is normal, does it mean the column does not obey benford law?

Any help would be very great.

Brake benford.py in several files

integral counts in 10s and 100s cause issues in the 2-digit benford graph

So, if I push numbers with distributions in the 1's and 10's (nothing really less than 100), then the two digit benford registers the 1.0, 2.0, 3.0 etc. as instances of the 1-0, 2-0, (and so on) digit pattern.

So, the bottom line issue is that for integral distributions involving some samples which are less than 10, the 2-digit benford will fail, and will yield a graph that can be a major false alarm.

The function used was:

benford.first_digits

Refactor for pandas roadmap

Method chaining
Dropping in place=True

Critical KS val NoneType division

The Kolmogorov Smirnof critical value is computed in real time when calling the Test class property critical_values.
This causes it to break when the confidence is None.

Benford class tests as dict items AND @properties callable

Tests as dict items (remove Boolean flags like has_summation etc)

Improve setup.py

summation() unexpected kwarg 'verbose'

Hello,
I get this error with the summation() which resembles the closed issue a fellow member had with mad(). Unfortunately, I can't find a way around it.

--Code--

Initialized sequence with 1452 registries.

TypeError Traceback (most recent call last)

in ()
1 #Summation
----> 2 bf.summation(rev['Revenues'],digs=2,show_plot=True)

/usr/local/lib/python3.6/dist-packages/benford/benford.py in summation(data, digs, decimals, sign, top, verbose, show_plot, save_plot, save_plot_kwargs, inform)
1554 show_plot=show_plot, save_plot=save_plot,
1555 save_plot_kwargs=save_plot_kwargs, ret_df=True)
-> 1556 if verbose == True:
1557 return data.sort_values('AbsDif', ascending=False)
1558 else:

TypeError: summation() got an unexpected keyword argument 'verbose'

The code makes comparisons to None using ==

It is not recommended to make comparisons to None using == due to possible compilation differences in the binaries for the value "None".
Use if K is None: instead

What the “dicoard 20 records <1 after preparation ”means?

@milcent Could please tell me that the “discarded 20 records <1 after preparation ”means?do I need fix something?

Divide by zero warning from np.where in Kullback-Leibler Divergence

Bendford py last 2 digits

I have performed every other bendford test but this one requires decimal values:

Last Two Digits Test

L2d = bf.last_two_digits(orders["Order_amount"].astype(int), decimals=1, confidence=99)

Warning

C:\Users\cgarciadiaz\AppData\Local\Continuum\anaconda3\envs\Supplier_DTEC\lib\site-packages\benford\benford.py:471: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
temp['L2D'] = temp.ZN % 100

The code makes redundant equality comparisons (K == True)

It is not recommended to make comparisons to True or False using == due to possible compilation differences in the binaries for the values of the booleans.

Use if K: or if not K instead

Add Bootstrap Regression Procedure as a Test method

Here you can find an explanation about why this method is better than using Z-Score, Chi-Square or MAD
https://opensiuc.lib.siu.edu/cgi/viewcontent.cgi?article=1032&context=epse_pubs

https://opensiuc.lib.siu.edu/cgi/viewcontent.cgi?article=1033&context=epse_pubs

Module's name and changes on pypi

Think changing the name from "benford_py" to simply "benford" is more intuitive, like "pandas", "numpy", etc.

I reinstalled the module, in order to include arc_test function, but it's not updated on Pypi.

mad() got an unexpected keyword argument 'verbose'

Just trying to grab the MAD value for my dataset, but faced this error. [Recently updated to 0.2.5]
Codes are as follow:
import benford as bf
mad1 = bf.mad(rawdata["Gross Total"], test=1)

Traceback (most recent call last):

File "", line 1, in
mad1 = bf.mad(rawdata["Gross Total"], test=1)

File "C:\Users\sebastianlimzj\AppData\Local\Continuum\anaconda3\lib\site-packages\benford\benford.py", line 1362, in mad
start.first_digits(digs=test, MAD=True, MSE=True, simple=True)

File "C:\Users\sebastianlimzj\AppData\Local\Continuum\anaconda3\lib\site-packages\benford\benford.py", line 639, in first_digits
self.MAD = mad(df, test=digs, verbose=self.verbose)

TypeError: mad() got an unexpected keyword argument 'verbose'

Bug: TypeError when calling 'chi_square' argument

Bug

''TypeError: 'bool' object is not callable'' when calling 'chi_square' argument→
Bug-TypeError of chi_square.ipynb.zip

Platform

MacOS Catalina 10.15.1
Python 3.7.1
benford-py 0.2.6

Publish v0.2.0 to Pypi

Summation Test output

Hello!
First of all, you did a very good job writing this library. It saves a lot of time for all of us Benford researchers without proper programming education.
I would like to bring to your attention a possible malfunction of the Summation Test function.
When I run it for a multitude of columns with a for loop it plots the histograms at the end, not with the respected tables produced from each separate column. This makes it hard to match each histogram with the column/variable it refers to.
Furthermore, the produced tables are ordered by the F2D column and not the difference column as implied by the title of the table.
I wish you the best!

Edit: Could you please provide some insight about what does the AbsDif represent and how it could be used?

Readme - 0.2.0 features

Selection of tests to be subjected to Zscore (and other stats that use sample size) in Benford object

Currently, the Benford object, after internalizing the data, perfoms all first order tests (F1D, F2D, F3D, SD and L2D) by default.
Then one can use other methos so it also includes the Summation (F1D_Sum, F2D_Sum and F3D_Sum), the Mantissas and the Second Order (F1D_Sec, F2D_Sec...) tests.
However, when the data do not span across multiple orders of magnitude, some digits combinations may have no hits, especially the First-Two-Digits (45, 77...) and First-Three-Digits (122, 670,...), which will cause a DivisionbyZero error when computing the Z scores, since it uses the number of hits as denominator, as well noticed by @ditlevjoergensen in #40.
Ideally, then, when initializing, there should be some safeguard preventing the Z scores of even being called on that Test object if there are no hits, and the user is informed of that if the verbose flag is on.

No such file or directory: '/wrkdirs/usr/ports/math/py-benford_py/work-py37/benford_py-0.3.2/README-pypi.md'

The build of the tarball 0.3.2 from the PyPI page fails:

===>  Configuring for py37-benford_py-0.3.2
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "setup.py", line 6, in <module>
    with open(path.join(this_directory, 'README-pypi.md'), encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/wrkdirs/usr/ports/math/py-benford_py/work-py37/benford_py-0.3.2/README-pypi.md'
*** Error code 1

Reorganize the code as a Python Package

Add interactive viz

Add sample-independent conformity tests - Bhattacharyya distance and Kullback–Leibler divergence

This issue was born from a discussion with Rosa María Maza Quiroga, a Ph.D. student at the University of Malaga. She asked for the exact p-values when executing the Kolmogorov-Smirnov test, which is not implemented (just the critical ones), and ended giving it up since her samples were really big, and the KS is known to be best-suited to continuous distributions.
So, she started using the Bhattacharyya distance and Kullback–Leibler divergence and suggested we implemented here.
She has been kind enough to provide the basis for the code, which I reproduce here so I can find it easier when I´m implementing them:

def bhattacharyya_coefficient(distribution_1, distribution_2):
    return np.sum(np.sqrt(distribution_1 * distribution_2))

def bhattacharyya_distance(distribution1, distribution2):
    return -np.log(bhattacharyya_coefficient(distribution1, distribution2))


def kullbackLeibler_divergence(distribution_1, distribution_2):
    return np.sum(np.where(distribution_1 != 0, distribution_1 * np.log(distribution_1/ distribution_2),0))

Option to save figure when plotting

As suggested by Данила Пилин.
When calling the functions or methods that have show_plot as argument, there should also be an option to save the plot to a file.

The code makes negated belonging comparisons (if not K in list:)

The code performs negated belonging operations (if not k in list), that are ambiguous and might imply in error in different implementations of Python.
Use if K not in list instead

Disable pandas unnecessary warning in release versions

warnings.filterwarnings("ignore", category=DeprecationWarning)

Reports

Demo Notebook improvements

Source header missed during relicense

Hi Marcel,

When relicensing, I believe you missed changing this file:

https://github.com/milcent/benford_py/blob/master/benford/__init__.py

It still has a GPL-3.0 license header.

Tests, tests..., and tests (I meant PY-tests)

Benford class image summary

A graphical representation of the main features , with visual explanation of the tests and most relevnat parameters

last-digit test, L1D/LDT?

A test using only the last digit may be useful if only few digits are presented, and the second last may be skewed.

I only saw a last-two digits test.

Files on PyPI don't match the project name

https://pypi.org/project/benford-py/#files has benford_py-0.3.2.tar.gz while the project is named benford-py.

IMO if you renamed the project, this should affect file names, GH project name, etc, to reduce confusion.

Warning depending on sample size

Fix KS critical value for 90% confidence

Raised by Octavian Ceban

Kolmogorov-Smirnov test in Mantissas distribution

Although implemented in the Test class, the Kolmogorov-Smirnov statistic is known to work for continuous distributions, not discrete ones, such as the ones from the digits tests (1, 2, 3…9, and so on).
The Mantissas test, however, is made on a continuous distribution, since the ordered mantissas of the log10 of a Benford set numbers are expected to form a straight line from 0 to almost 1. So far, the mantissas test is more of a qualitative one, from eyeballing the plot. We should implement and test KS for the mantissas, and check it’s prospects.