carloscinelli / benford.analysis Goto Github PK

Tools that make it easier to use Benford’s law for data validation and forensic analytics.

R 100.00%

benford.analysis's Introduction

benford.analysis

The Benford Analysis (benford.analysis) package provides tools that make it easier to validate data using Benford’s Law. The main purpose of the package is to identify suspicious data that need further verification.

CRAN

You can install the package from CRAN by running:

install.packages("benford.analysis")

How to install the development version from GitHub

To install the GitHub version you need to have the package devtools installed. Make sure to set the option build_vignettes = TRUE to compile the package vignette.

# install.packages("devtools") # run this to install the devtools package
devtools::install_github("carloscinelli/benford.analysis", build_vignettes = TRUE)

Example usage

The benford.analysis package comes with 6 real datasets from Mark Nigrini’s book Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection.

Here we will give an example using 189.470 records from the corporate payments data. First we need to load the package and the data:

library(benford.analysis) # loads package
data(corporate.payment) # loads data

Then to validade the data against Benford’s law you simply use the function benford in the appropriate column:

bfd.cp <- benford(corporate.payment$Amount)

The command above created an object of class “Benford” with the results for the analysis using the first two significant digits. You can choose a different number of digits changing the number.of.digits paramater. For more information and parameters see ?benford:

Let’s check the main plots of the analysis:

plot(bfd.cp)

The original data is in blue and the expected frequency according to Benford’s law is in red. For instance, in our example, the first plot shows that the data do have a tendency to follow Benford’s law, but also that there is a clear discrepancy at 50.

You can print the main results of the analysis:

bfd.cp
#> 
#> Benford object:
#>  
#> Data: corporate.payment$Amount 
#> Number of observations used = 185083 
#> Number of obs. for second order = 65504 
#> First digits analysed = 2
#> 
#> Mantissa: 
#> 
#>    Statistic  Value
#>         Mean  0.496
#>          Var  0.092
#>  Ex.Kurtosis -1.257
#>     Skewness -0.002
#> 
#> 
#> The 5 largest deviations: 
#> 
#>   digits absolute.diff
#> 1     50       5938.25
#> 2     11       3331.98
#> 3     10       2811.92
#> 4     14       1043.68
#> 5     98        889.95
#> 
#> Stats:
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  corporate.payment$Amount
#> X-squared = 32094, df = 89, p-value < 2.2e-16
#> 
#> 
#>  Mantissa Arc Test
#> 
#> data:  corporate.payment$Amount
#> L2 = 0.0039958, df = 2, p-value < 2.2e-16
#> 
#> 
#>  Kolmogorov-Smirnov test
#> 
#> data:  corporate.payment$Amount
#> D = 0.033195, critical value = 0.0031612
#> 
#> Mean Absolute Deviation (MAD): 0.002336614
#> MAD Conformity - Nigrini (2012): Nonconformity
#> Distortion Factor: -1.065467
#> 
#> Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

The print method first shows the general information of the analysis, like the name of the data used, the number of observations used and how many significant digits were analyzed.

After that you have the main statistics of the log mantissa of the data. If the data follows Benford’s Law, the numbers should be close to:

Statistic	Value
Mean	0.5
Variance	1/12 (0.08333…)
Ex. Kurtosis	-1.2
Skewness	0

Printing also shows the 5 largest discrepancies. Notice that, as we had seen on the plot, the highest deviation is 50. These deviations are good candidates for closer inspections. It also shows the results of statistical tests like the Chi-squared test and the Mantissa Arc test.

The package provides some helper functions to further investigate the data. For example, you can easily extract the observations with the largest discrepancies by using the getSuspects function.

suspects <- getSuspects(bfd.cp, corporate.payment)
suspects
#> Warning in format.POSIXlt(as.POSIXlt(x), ...): unknown timezone 'zone/tz/
#> 2019b.1.0/zoneinfo/America/Los_Angeles'
#>        VendorNum       Date       InvNum  Amount
#>     1:      2001 2010-01-02      3822J10   50.38
#>     2:      2001 2010-01-07     100107-2 1166.29
#>     3:      2001 2010-01-08  11210084007 1171.45
#>     4:      2001 2010-01-08      1585J10   50.42
#>     5:      2001 2010-01-08      4733J10  113.34
#>    ---                                          
#> 17852:     52867 2010-07-01 270358343233   11.58
#> 17853:     52870 2010-02-01 270682253025   11.20
#> 17854:     52904 2010-06-01 271866383919   50.15
#> 17855:     52911 2010-02-01 270957401515   11.20
#> 17856:     52934 2010-02-01 271745237617   11.88

More information can be found on the help documentation and examples. The vignette will be ready soon.

benford.analysis's People

Contributors

Stargazers

Watchers

Forkers

henry-e jeanguybillard hannes101 kieroneil rafaelslins milcent fagan2888 nagdevamruthnath clinicopath guhjy ashleighgonzales lupupam leungrhy

benford.analysis's Issues

Include last two digits plots (histogram and rootogram)

with error bounds
relative and absolute frequency

Allow user to choose the significance level of the error bars

Include legend only when plots that need legend are plotted

For instance, if one plots only a chi squared plot, no legend is needed.

except = "none" should be equal select = "all"

Meaning of number.of.digits in case of non-discreate numbers

Hi Carlos,

thanks a lot for this package, it's really useful! I am asking about the meaning of this option, since I am not sure what exactly it means. I am analysing interest rate data and thus I am not sure what exactly it means, that in the documentation it states that "number.of.digits" defines how many first digits to analyse.
Does it care about the decimal separator of a numeric value or does it just take the number in as it is, like 035 for an interest rate of 0.35%?

Thanks a lot for some clarification

KS test significance level and p-values

@rafaelslins the current critical value for the KS-test seems to be hard coded for the significance level of 5%.

For now, this should be informed in the output of the test, since the significance level is hard coded.

But as an improvement, we should either provide other significance levels as an option, or, instead, provide the p-value of the KS-test (we can use a ks critical value table to get the approximate p-value).

Another alternative is using the base R ks.test function, but we need to double check whether it works well.

usage of 'select' and 'except' arguments

data(corporate.payment)
cp <- corporate.payment$Amount
plot(cp, select = "digits", except = "digits")
plot(cp, select = "none", except = "all")

Planning paper on Benford Analysis using your package

Carlos, are you still maintaining this package? I saw that there was an update in 2017 but not much activity otherwise.
I recently was asked to perform a Benford Analysis in my professional work on financial data and your package was very helpful. I am going to write a general paper that might be turned into a vignette and would like to reach out to you if I have specific questions.
Thanks for building this.
Kier

Include new mantissa arc plot

Adjustment in plot functions

@rafaelslins

the argument grid=F is not working for the summation graph
remove rootogram from the default plots, 8 plots as default is too much.
Change title: "Barchart of digits" --> "Digits distribution\nBarchart". Where "\n" is a line break.
Change title: "Rootogram of digits" --? "Digits distribution\nRootogram".
Do the same for sencond order but now "Digits distribution\nSecond Order Test - Barchart", "Digits distribution\nSecond Order Test - Rootogram".
Can we remove the extra white-space in the Chi-squared and summation difference plots?

Time out in Power Bi service

Hi! First of all, thank you for this great work. I am using on Power Bi Desktop and works great. However, for some reason, it's timing out in Power Bi Service (cloud). The error message is that:

Execution Timeout
The script execution timed out, please try again later

Could you give me some guidance for this issue?
Thank you!

Readme file

It was recently pointed out to us that some README.html files (generated
from the corresponding README.md ones) on the CRAN package web pages are
incomplete, missing 'local' images not available from the web page and
in most cases actually not even shipped with the package. This clearly
should be changed, so we will move to using '--self-contained' for the
pandoc conversion to ensure that the README.html files are "complete".

Of course, this implies that all 'local' images used in README.md are
needed in the package sources.

If the images are also used for vignettes or Rd files, you can put them
in the 'vignettes' or 'man/figures' directories. Otherwise, please put
them in the top-level 'tools' directory, or a subdirectory of it.

The CRAN incoming checks in r-devel were changed to perform the pandoc
conversion checks with '--self-contained', and hence will warn about
missing images.

Pls ensure completeness in the next regular update of your package.

For your information, I attach a list of images used in README.md but
missing from the package sources.

Best
-k

Extra plot in function plot.Benford

Although I drop all other options, the function plot.Benford I use also gives me two plots. Here is an example in reprex

# Load package benford.analysis
library(benford.analysis)
data(census.2009)

# Check conformity
bfd.cen <- benford(census.2009$pop.2009, number.of.digits = 1) 
plot(bfd.cen, except = c("second order", "summation", "mantissa", "chi squared","abs diff", "ex summation", "Legend"), multiple = F)

I think a better way is to avoid the second plot.

Here is my session infomation.

"click the button"
#> [1] "click the button"

^{Created on 2018-11-24 by the reprex package (v0.2.1)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS  10.14                
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  zh_CN.UTF-8                 
#>  ctype    zh_CN.UTF-8                 
#>  tz       Asia/Shanghai               
#>  date     2018-11-24                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
#>  backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)
#>  base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.0)
#>  callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.0)
#>  cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
#>  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
#>  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)
#>  evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.0)
#>  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)
#>  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
#>  knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
#>  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
#>  processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.0)
#>  ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.0)
#>  R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)
#>  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
#>  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)
#>  rlang         0.3.0.1 2018-10-25 [1] CRAN (R 3.5.0)
#>  rmarkdown     1.10    2018-06-11 [1] CRAN (R 3.5.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)
#>  stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.0)
#>  stringr       1.3.1   2018-05-10 [1] CRAN (R 3.5.0)
#>  testthat      2.0.0   2017-12-13 [1] CRAN (R 3.5.0)
#>  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Add option to select certain digits and analyze them separately

I would like to ask if it would be possible to add an option so as to make it possible to not only analyze subsequent digits together with the first digits, but alone.
I know that this is not really in the spirit of the original benford analysis, but it would be interesting to check the distribution of the digits alone and it probably would be pretty easy to implement, or? I will try to come up with something on my own, so that it becomes a bit clearer, what I mean with this request.

All the best,
Johannes

adjust graph limits to show error bounds

The error bounds must be computed before setting up the plot limits, so we are sure the bounds are shown in the plot correctly.

Bug for extract.digits

When the figure is exactly 1e11 (ie 100,000,000,000), the extract.digit() output is 0 instead of 1. Might have some issues about the trunc part. Thanks

new plot to analyze the difference (first digits) vs excess summation

Exact error bound using the binomial distribution

Write vignette

Write package vignette.

fix notes from r-devel checks

no visible global function definition
undefined global functions or variables

Include option of relative frequency in histogram and rootogram

better print testing

Testing for numeric data

Hello Carlos,

When you are testing whether or not the data supplied to extract.digits() is numeric data, I think you should be using is.numeric(). As it stands now, the use of class(data) != "numeric" produces unexpected errors when the class is "integer". Consider the following:

library(benford.analysis)

dat <- data.frame(v1 = 1:5, v2 = c(1, 2, 3, 4, 5))

benford(dat$v1)          # produces error

This error is returned because while dat$v1 is clearly numeric, the call class(dat$v1) will return "integer".

On a different note, thank you for a wonderful R package!

-Paul

Warning messages in simple example

This example creates warning messages:

test <- benford(1000)

Verify what is going on.

"freq" parameter of benford function

due to recent updates "freq" not working

create readme

include package stats on readme
include brief example
instructions on how to install on github and CRAN

Include "second digit" only plots

Maybe we should also allow the analysis of the second digit only. (or an arbitrary digit only).

Missing labels of the summary statistics of the mantissa

In print.Benford the labels for the summary stats of the mantissa are missing. They were accidentally excluded.

attribute difference in unit tests equality check

Current devel data.table gets own precise method for all.equal, which testthat seems to use behind the scene.
It appears that unit test named Negative numbers, simulated log-normal *(-1) has expected results defined as test data.table which has named vector inside as a column. This does not match to output from your mantissa function.

Expected vs Observed 45 degree line plot

Additional function to use MAD to assess conformity to Benford's Law

Hi Carlos,

since I stumbled across the problem that the Chi-Square tests suffer from excess power, when using a lot of observations. I've implemented a rather simple function to implement the check based on the MAD as proposed by Nigrini (2012, Chapter 7, page 160).
I basically just check the MAD value based on the digits analyzed and then just print out the decision based on Nigrinis intervals. Don't know if that's a valuable contribution, but at least it's an additional point to add, since the Chi-Square test often leads to wrong conclusions.

Cheers,
Johannes
201803_MAD_Conformity.txt
Nigrini, M.J. (2012). Benford's law. Applications for forensic accounting, auditing, and fraud detection (Hoboken, NJ: Wiley).

Improve default plot order and legend placement

It might not be clear to users to what the chi-squared difference refers to. Maybe put both next to each other, and improve the description of the plot.

Also, think about a better default legend placement.