Giter Site home page Giter Site logo

benford.analysis's Introduction

benford.analysis

Travis-CI Build Status Build status Coverage Status CRAN_Status_Badge

The Benford Analysis (benford.analysis) package provides tools that make it easier to validate data using Benford’s Law. The main purpose of the package is to identify suspicious data that need further verification.

CRAN

You can install the package from CRAN by running:

install.packages("benford.analysis")

How to install the development version from GitHub

To install the GitHub version you need to have the package devtools installed. Make sure to set the option build_vignettes = TRUE to compile the package vignette.

# install.packages("devtools") # run this to install the devtools package
devtools::install_github("carloscinelli/benford.analysis", build_vignettes = TRUE)

Example usage

The benford.analysis package comes with 6 real datasets from Mark Nigrini’s book Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection.

Here we will give an example using 189.470 records from the corporate payments data. First we need to load the package and the data:

library(benford.analysis) # loads package
data(corporate.payment) # loads data

Then to validade the data against Benford’s law you simply use the function benford in the appropriate column:

bfd.cp <- benford(corporate.payment$Amount)

The command above created an object of class “Benford” with the results for the analysis using the first two significant digits. You can choose a different number of digits changing the number.of.digits paramater. For more information and parameters see ?benford:

Let’s check the main plots of the analysis:

plot(bfd.cp)

The original data is in blue and the expected frequency according to Benford’s law is in red. For instance, in our example, the first plot shows that the data do have a tendency to follow Benford’s law, but also that there is a clear discrepancy at 50.

You can print the main results of the analysis:

bfd.cp
#> 
#> Benford object:
#>  
#> Data: corporate.payment$Amount 
#> Number of observations used = 185083 
#> Number of obs. for second order = 65504 
#> First digits analysed = 2
#> 
#> Mantissa: 
#> 
#>    Statistic  Value
#>         Mean  0.496
#>          Var  0.092
#>  Ex.Kurtosis -1.257
#>     Skewness -0.002
#> 
#> 
#> The 5 largest deviations: 
#> 
#>   digits absolute.diff
#> 1     50       5938.25
#> 2     11       3331.98
#> 3     10       2811.92
#> 4     14       1043.68
#> 5     98        889.95
#> 
#> Stats:
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  corporate.payment$Amount
#> X-squared = 32094, df = 89, p-value < 2.2e-16
#> 
#> 
#>  Mantissa Arc Test
#> 
#> data:  corporate.payment$Amount
#> L2 = 0.0039958, df = 2, p-value < 2.2e-16
#> 
#> 
#>  Kolmogorov-Smirnov test
#> 
#> data:  corporate.payment$Amount
#> D = 0.033195, critical value = 0.0031612
#> 
#> Mean Absolute Deviation (MAD): 0.002336614
#> MAD Conformity - Nigrini (2012): Nonconformity
#> Distortion Factor: -1.065467
#> 
#> Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

The print method first shows the general information of the analysis, like the name of the data used, the number of observations used and how many significant digits were analyzed.

After that you have the main statistics of the log mantissa of the data. If the data follows Benford’s Law, the numbers should be close to:

Statistic Value
Mean 0.5
Variance 1/12 (0.08333…)
Ex. Kurtosis -1.2
Skewness 0

Printing also shows the 5 largest discrepancies. Notice that, as we had seen on the plot, the highest deviation is 50. These deviations are good candidates for closer inspections. It also shows the results of statistical tests like the Chi-squared test and the Mantissa Arc test.

The package provides some helper functions to further investigate the data. For example, you can easily extract the observations with the largest discrepancies by using the getSuspects function.

suspects <- getSuspects(bfd.cp, corporate.payment)
suspects
#> Warning in format.POSIXlt(as.POSIXlt(x), ...): unknown timezone 'zone/tz/
#> 2019b.1.0/zoneinfo/America/Los_Angeles'
#>        VendorNum       Date       InvNum  Amount
#>     1:      2001 2010-01-02      3822J10   50.38
#>     2:      2001 2010-01-07     100107-2 1166.29
#>     3:      2001 2010-01-08  11210084007 1171.45
#>     4:      2001 2010-01-08      1585J10   50.42
#>     5:      2001 2010-01-08      4733J10  113.34
#>    ---                                          
#> 17852:     52867 2010-07-01 270358343233   11.58
#> 17853:     52870 2010-02-01 270682253025   11.20
#> 17854:     52904 2010-06-01 271866383919   50.15
#> 17855:     52911 2010-02-01 270957401515   11.20
#> 17856:     52934 2010-02-01 271745237617   11.88

More information can be found on the help documentation and examples. The vignette will be ready soon.

benford.analysis's People

Contributors

carloscinelli avatar jangorecki avatar rafaelslins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benford.analysis's Issues

Meaning of number.of.digits in case of non-discreate numbers

Hi Carlos,

thanks a lot for this package, it's really useful! I am asking about the meaning of this option, since I am not sure what exactly it means. I am analysing interest rate data and thus I am not sure what exactly it means, that in the documentation it states that "number.of.digits" defines how many first digits to analyse.
Does it care about the decimal separator of a numeric value or does it just take the number in as it is, like 035 for an interest rate of 0.35%?

Thanks a lot for some clarification

KS test significance level and p-values

@rafaelslins the current critical value for the KS-test seems to be hard coded for the significance level of 5%.

For now, this should be informed in the output of the test, since the significance level is hard coded.

But as an improvement, we should either provide other significance levels as an option, or, instead, provide the p-value of the KS-test (we can use a ks critical value table to get the approximate p-value).

Another alternative is using the base R ks.test function, but we need to double check whether it works well.

Planning paper on Benford Analysis using your package

Carlos, are you still maintaining this package? I saw that there was an update in 2017 but not much activity otherwise.
I recently was asked to perform a Benford Analysis in my professional work on financial data and your package was very helpful. I am going to write a general paper that might be turned into a vignette and would like to reach out to you if I have specific questions.
Thanks for building this.
Kier

Adjustment in plot functions

@rafaelslins

  • the argument grid=F is not working for the summation graph

  • remove rootogram from the default plots, 8 plots as default is too much.

  • Change title: "Barchart of digits" --> "Digits distribution\nBarchart". Where "\n" is a line break.

  • Change title: "Rootogram of digits" --? "Digits distribution\nRootogram".

  • Do the same for sencond order but now "Digits distribution\nSecond Order Test - Barchart", "Digits distribution\nSecond Order Test - Rootogram".

  • Can we remove the extra white-space in the Chi-squared and summation difference plots?

Time out in Power Bi service

Hi! First of all, thank you for this great work. I am using on Power Bi Desktop and works great. However, for some reason, it's timing out in Power Bi Service (cloud). The error message is that:

Execution Timeout
The script execution timed out, please try again later

Could you give me some guidance for this issue?
Thank you!

Readme file

It was recently pointed out to us that some README.html files (generated
from the corresponding README.md ones) on the CRAN package web pages are
incomplete, missing 'local' images not available from the web page and
in most cases actually not even shipped with the package. This clearly
should be changed, so we will move to using '--self-contained' for the
pandoc conversion to ensure that the README.html files are "complete".

Of course, this implies that all 'local' images used in README.md are
needed in the package sources.

If the images are also used for vignettes or Rd files, you can put them
in the 'vignettes' or 'man/figures' directories. Otherwise, please put
them in the top-level 'tools' directory, or a subdirectory of it.

The CRAN incoming checks in r-devel were changed to perform the pandoc
conversion checks with '--self-contained', and hence will warn about
missing images.

Pls ensure completeness in the next regular update of your package.

For your information, I attach a list of images used in README.md but
missing from the package sources.

Best
-k

Extra plot in function plot.Benford

Although I drop all other options, the function plot.Benford I use also gives me two plots. Here is an example in reprex

# Load package benford.analysis
library(benford.analysis)
data(census.2009)

# Check conformity
bfd.cen <- benford(census.2009$pop.2009, number.of.digits = 1) 
plot(bfd.cen, except = c("second order", "summation", "mantissa", "chi squared","abs diff", "ex summation", "Legend"), multiple = F) 

image
image

I think a better way is to avoid the second plot.

Here is my session infomation.

"click the button"
#> [1] "click the button"

Created on 2018-11-24 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS  10.14                
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  zh_CN.UTF-8                 
#>  ctype    zh_CN.UTF-8                 
#>  tz       Asia/Shanghai               
#>  date     2018-11-24                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
#>  backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)
#>  base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.0)
#>  callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.0)
#>  cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
#>  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
#>  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.0)
#>  evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.0)
#>  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.0)
#>  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
#>  knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
#>  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
#>  processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.0)
#>  ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.0)
#>  R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)
#>  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
#>  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)
#>  rlang         0.3.0.1 2018-10-25 [1] CRAN (R 3.5.0)
#>  rmarkdown     1.10    2018-06-11 [1] CRAN (R 3.5.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)
#>  stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.0)
#>  stringr       1.3.1   2018-05-10 [1] CRAN (R 3.5.0)
#>  testthat      2.0.0   2017-12-13 [1] CRAN (R 3.5.0)
#>  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Add option to select certain digits and analyze them separately

I would like to ask if it would be possible to add an option so as to make it possible to not only analyze subsequent digits together with the first digits, but alone.
I know that this is not really in the spirit of the original benford analysis, but it would be interesting to check the distribution of the digits alone and it probably would be pretty easy to implement, or? I will try to come up with something on my own, so that it becomes a bit clearer, what I mean with this request.

All the best,
Johannes

Bug for extract.digits

When the figure is exactly 1e11 (ie 100,000,000,000), the extract.digit() output is 0 instead of 1. Might have some issues about the trunc part. Thanks

Testing for numeric data

Hello Carlos,

When you are testing whether or not the data supplied to extract.digits() is numeric data, I think you should be using is.numeric(). As it stands now, the use of class(data) != "numeric" produces unexpected errors when the class is "integer". Consider the following:

library(benford.analysis)

dat <- data.frame(v1 = 1:5, v2 = c(1, 2, 3, 4, 5))

benford(dat$v1)          # produces error

This error is returned because while dat$v1 is clearly numeric, the call class(dat$v1) will return "integer".

On a different note, thank you for a wonderful R package!

-Paul

create readme

  • include package stats on readme
  • include brief example
  • instructions on how to install on github and CRAN

attribute difference in unit tests equality check

Current devel data.table gets own precise method for all.equal, which testthat seems to use behind the scene.
It appears that unit test named Negative numbers, simulated log-normal *(-1) has expected results defined as test data.table which has named vector inside as a column. This does not match to output from your mantissa function.

Additional function to use MAD to assess conformity to Benford's Law

Hi Carlos,

since I stumbled across the problem that the Chi-Square tests suffer from excess power, when using a lot of observations. I've implemented a rather simple function to implement the check based on the MAD as proposed by Nigrini (2012, Chapter 7, page 160).
I basically just check the MAD value based on the digits analyzed and then just print out the decision based on Nigrinis intervals. Don't know if that's a valuable contribution, but at least it's an additional point to add, since the Chi-Square test often leads to wrong conclusions.

Cheers,
Johannes
201803_MAD_Conformity.txt
Nigrini, M.J. (2012). Benford's law. Applications for forensic accounting, auditing, and fraud detection (Hoboken, NJ: Wiley).

Improve default plot order and legend placement

It might not be clear to users to what the chi-squared difference refers to. Maybe put both next to each other, and improve the description of the plot.

Also, think about a better default legend placement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.