Giter Site home page Giter Site logo

Write vignette about benford.analysis HOT 4 OPEN

carloscinelli avatar carloscinelli commented on August 23, 2024
Write vignette

from benford.analysis.

Comments (4)

Henry-E avatar Henry-E commented on August 23, 2024

It would be really helpful if you could outline how to interpret the values returned from the print and plot functions

from benford.analysis.

Henry-E avatar Henry-E commented on August 23, 2024

Hey, I made up an rmarkdown document comparing a couple of different distributions that typically follow benfords law (lognormal with high standard deviation, exponential) and some that don't (lognormal with small standard deviation, normal, uniform).

In terms of using the output of the print and plot functions to determine which groups of numbers follow benfords law, I'm having trouble understanding how to interpret them succesfully. Aside from the classic digits distribution chart, it's hard to see what indicators could be used in identifying rogue sets of numbers. I know you say to not focus on the p-values but for many of the distrubtions that I've looked at in practice that appear to follow Benford's law, still have tiny p-values (< 1e5 in many cases).

Do you have any reccomedations or tips for how you use the package. Like yourself, I'm trying to apply this in a central bank.

---
title: "Benford's law, digit analysis"
output: 
  html_document:
    toc: true
    toc_float: true
# params:
#   numberOfDigits: 1
#   columnToGroupBy: NA
#   data: NA
---

```{r setup, echo=FALSE,  warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(stringr)
library(benford.analysis)
library(pander)
options(stringsAsFactors = FALSE)

numObs <- 1e4
data <- tbl_df(bind_rows(data.frame(value = runif(numObs), distribution = "uniform"), 
                         data.frame(value = abs(rnorm(numObs)), distribution = "normal"),
                         data.frame(value = rexp(numObs, rate = 1), distribution = "exponential"),
                         data.frame(value = rlnorm(numObs, meanlog = 1, sdlog = 0.6), distribution = "lognormal - narrow"),
                         data.frame(value = rlnorm(numObs, meanlog = 10, sdlog = 100), distribution = "lognormal - wide")))
columnToGroupBy <- c("distribution")

params <- list()
params$numberOfDigits <- 1
params$columnToGroupBy <- columnToGroupBy
params$data <- data

# we tried to use NULL here before but that leads to an issue
# https://github.com/rstudio/rmarkdown/issues/729
if(params$columnToGroupBy == "NA" || params$data == "NA")
  stop("missing column name(s) or data")

if(ncol(params$data) < 2)
  stop("not enough columns in the data")

areColumnsPresent <- c(params$columnToGroupBy, "value") %in% colnames(params$data)
if(!all(areColumnsPresent))
  stop(str_c(c("columns requested are not present:", params$columnToGroupBy), collapse = "\n"))

dataTypeOfValueColumn <- sapply(params$data, class)["value"]
if(dataTypeOfValueColumn != "integer" && dataTypeOfValueColumn != "numeric")
  stop(str_c("Data type of the value column should be integer or numeric, instead it is: ", dataTypeOfValueColumn))

if(params$numberOfDigits > 2)
  stop(str_c("checking that number of digits is probably not a great idea ", params$numberOfDigits))
​​​​```

```{r benfordAnalysis, results = "asis"}
# the do() function applies the benford function to each group
# and create a list of S3 objects
# https://stat545-ubc.github.io/block023_dplyr-do.html#meet-do
allTheBenfords <- params$data %>%
  group_by_(.dots = params$columnToGroupBy) %>%
  do(benford =  benford(.$value, number.of.digits = params$numberOfDigits))

# order them by the Mean Absolute Deviation, it's not a totally accurate way to do
# it but it's better than nothing. I couldn't figure out how to do it as part
# the dplyr chain, the S3 object subsetting got the better of me. I was able to
# use do and the .(dot) operator to get a column of the values but it deleted all
# the other columns for some reason and I couldn't use the same syntax with mutate
allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford,
                                              function(x) x[["MAD"]]), decreasing = TRUE), ]
# We're testing out ranking by the mantissa arc 
# allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford, 
#                                               function(x) x$stats$mantissa.arc.test$p.value), decreasing = TRUE), ]

for(observation in seq(nrow(allTheBenfords))){

  thisBenfordAnalysis <- allTheBenfords$benford[[observation]]
  cat("\n\n## ", allTheBenfords[observation, params$columnToGroupBy][[1]], "\n\n")

  plot(thisBenfordAnalysis)
  cat("\n")

  # cribbed from the print function for benford.analysis class, which does not
  # print well in rmarkdown
  # https://github.com/carloscinelli/benford.analysis/blob/master/R/functions-new.R
  overallDescription <- data.frame(description =
                                  c("Number of observations used", 
                                     "Number of obs. for second order", 
                                     "First digits analysed"),
                                  value =
                                   c(thisBenfordAnalysis[["info"]]$n, 
                                     thisBenfordAnalysis[["info"]]$n.second.order,
                                     thisBenfordAnalysis[["info"]]$number.of.digits))
  cat("\n#### Overall description of analysis\n")
  cat(pandoc.table(overallDescription))

  cat("\n\n#### Mantissa arc test \n")
  mantissaTable <- thisBenfordAnalysis$mantissa
  mantissaTable$statistic <-  gsub("Mantissa|\\s", " ", mantissaTable$statistic)
  mantissaTable$`expected value` <- c(0.5, 0.08333, -1.2, 0)
  cat(pandoc.table(mantissaTable))
  cat("\n\n#### Numerical deviations")
  how.many <- 5
  cat("\nThe", how.many, "largest deviations: \n\n")
  cat(pandoc.table(round(head(thisBenfordAnalysis[["bfd"]][order(absolute.diff, decreasing=TRUE)][,list(digits, absolute.diff)], how.many), 2)), sep="\n")
  deviationStats <- data.frame(description =
                                 c("Mean Absolute Deviation",
                                   "Distortion Factor"),
                               value =
                                 c(thisBenfordAnalysis[["MAD"]],
                                   thisBenfordAnalysis[["distortion.factor"]]))
  cat(pandoc.table(deviationStats))

  cat("\n\n#### Statistics\n")
  cat("\n\n##### Pearson's Chi-squared test\n")
  chisqStats <- data.frame(description =
                            c("X-squared",
                              "df",
                              "p-value <"),
                           value =
                             c(thisBenfordAnalysis[["stats"]]$chisq$statistic,
                               thisBenfordAnalysis[["stats"]]$chisq$parameter,
                               thisBenfordAnalysis[["stats"]]$chisq$p.value))
  cat(pandoc.table(chisqStats))
  cat("\n\n##### Mantissa Arc Test\n")
  mantissaStats <- data.frame(description = 
                                c("L2",
                                  "df",
                                  "p-value <"),
                              value = 
                                c(thisBenfordAnalysis[["stats"]]$mantissa.arc.test$statistic,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$parameter,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$p.value))
  cat(pandoc.table(mantissaStats))
  cat("\n\nRemember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!\n\n")
}
​​​​```

from benford.analysis.

carloscinelli avatar carloscinelli commented on August 23, 2024

Thanks for the comment, Henry! I will take a look at your example and try to explain it better. But just to give a quick answer, the focus here is to have one more indicator (which you will combine with other indicators) to find suspicious data that will need further investigation. You shouldn't use it to get a clear cut "yes/no" answer about the quality of a data set or even a single data point.

from benford.analysis.

vonjd avatar vonjd commented on August 23, 2024

When will the vignette be ready?

from benford.analysis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.