Write package vignette.

Write vignette about benford.analysis HOT 4 OPEN

carloscinelli commented on August 23, 2024

Write vignette

from benford.analysis.

Comments (4)

Henry-E commented on August 23, 2024

It would be really helpful if you could outline how to interpret the values returned from the print and plot functions

from benford.analysis.

Henry-E commented on August 23, 2024

Hey, I made up an rmarkdown document comparing a couple of different distributions that typically follow benfords law (lognormal with high standard deviation, exponential) and some that don't (lognormal with small standard deviation, normal, uniform).

In terms of using the output of the print and plot functions to determine which groups of numbers follow benfords law, I'm having trouble understanding how to interpret them succesfully. Aside from the classic digits distribution chart, it's hard to see what indicators could be used in identifying rogue sets of numbers. I know you say to not focus on the p-values but for many of the distrubtions that I've looked at in practice that appear to follow Benford's law, still have tiny p-values (< 1e5 in many cases).

Do you have any reccomedations or tips for how you use the package. Like yourself, I'm trying to apply this in a central bank.

---
title: "Benford's law, digit analysis"
output: 
  html_document:
    toc: true
    toc_float: true
# params:
#   numberOfDigits: 1
#   columnToGroupBy: NA
#   data: NA
---

```{r setup, echo=FALSE,  warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(stringr)
library(benford.analysis)
library(pander)
options(stringsAsFactors = FALSE)

numObs <- 1e4
data <- tbl_df(bind_rows(data.frame(value = runif(numObs), distribution = "uniform"), 
                         data.frame(value = abs(rnorm(numObs)), distribution = "normal"),
                         data.frame(value = rexp(numObs, rate = 1), distribution = "exponential"),
                         data.frame(value = rlnorm(numObs, meanlog = 1, sdlog = 0.6), distribution = "lognormal - narrow"),
                         data.frame(value = rlnorm(numObs, meanlog = 10, sdlog = 100), distribution = "lognormal - wide")))
columnToGroupBy <- c("distribution")

params <- list()
params$numberOfDigits <- 1
params$columnToGroupBy <- columnToGroupBy
params$data <- data

# we tried to use NULL here before but that leads to an issue
# https://github.com/rstudio/rmarkdown/issues/729
if(params$columnToGroupBy == "NA" || params$data == "NA")
  stop("missing column name(s) or data")

if(ncol(params$data) < 2)
  stop("not enough columns in the data")

areColumnsPresent <- c(params$columnToGroupBy, "value") %in% colnames(params$data)
if(!all(areColumnsPresent))
  stop(str_c(c("columns requested are not present:", params$columnToGroupBy), collapse = "\n"))

dataTypeOfValueColumn <- sapply(params$data, class)["value"]
if(dataTypeOfValueColumn != "integer" && dataTypeOfValueColumn != "numeric")
  stop(str_c("Data type of the value column should be integer or numeric, instead it is: ", dataTypeOfValueColumn))

if(params$numberOfDigits > 2)
  stop(str_c("checking that number of digits is probably not a great idea ", params$numberOfDigits))
```

```{r benfordAnalysis, results = "asis"}
# the do() function applies the benford function to each group
# and create a list of S3 objects
# https://stat545-ubc.github.io/block023_dplyr-do.html#meet-do
allTheBenfords <- params$data %>%
  group_by_(.dots = params$columnToGroupBy) %>%
  do(benford =  benford(.$value, number.of.digits = params$numberOfDigits))

# order them by the Mean Absolute Deviation, it's not a totally accurate way to do
# it but it's better than nothing. I couldn't figure out how to do it as part
# the dplyr chain, the S3 object subsetting got the better of me. I was able to
# use do and the .(dot) operator to get a column of the values but it deleted all
# the other columns for some reason and I couldn't use the same syntax with mutate
allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford,
                                              function(x) x[["MAD"]]), decreasing = TRUE), ]
# We're testing out ranking by the mantissa arc 
# allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford, 
#                                               function(x) x$stats$mantissa.arc.test$p.value), decreasing = TRUE), ]

for(observation in seq(nrow(allTheBenfords))){

  thisBenfordAnalysis <- allTheBenfords$benford[[observation]]
  cat("\n\n## ", allTheBenfords[observation, params$columnToGroupBy][[1]], "\n\n")

  plot(thisBenfordAnalysis)
  cat("\n")

  # cribbed from the print function for benford.analysis class, which does not
  # print well in rmarkdown
  # https://github.com/carloscinelli/benford.analysis/blob/master/R/functions-new.R
  overallDescription <- data.frame(description =
                                  c("Number of observations used", 
                                     "Number of obs. for second order", 
                                     "First digits analysed"),
                                  value =
                                   c(thisBenfordAnalysis[["info"]]$n, 
                                     thisBenfordAnalysis[["info"]]$n.second.order,
                                     thisBenfordAnalysis[["info"]]$number.of.digits))
  cat("\n#### Overall description of analysis\n")
  cat(pandoc.table(overallDescription))

  cat("\n\n#### Mantissa arc test \n")
  mantissaTable <- thisBenfordAnalysis$mantissa
  mantissaTable$statistic <-  gsub("Mantissa|\\s", " ", mantissaTable$statistic)
  mantissaTable$`expected value` <- c(0.5, 0.08333, -1.2, 0)
  cat(pandoc.table(mantissaTable))
  cat("\n\n#### Numerical deviations")
  how.many <- 5
  cat("\nThe", how.many, "largest deviations: \n\n")
  cat(pandoc.table(round(head(thisBenfordAnalysis[["bfd"]][order(absolute.diff, decreasing=TRUE)][,list(digits, absolute.diff)], how.many), 2)), sep="\n")
  deviationStats <- data.frame(description =
                                 c("Mean Absolute Deviation",
                                   "Distortion Factor"),
                               value =
                                 c(thisBenfordAnalysis[["MAD"]],
                                   thisBenfordAnalysis[["distortion.factor"]]))
  cat(pandoc.table(deviationStats))

  cat("\n\n#### Statistics\n")
  cat("\n\n##### Pearson's Chi-squared test\n")
  chisqStats <- data.frame(description =
                            c("X-squared",
                              "df",
                              "p-value <"),
                           value =
                             c(thisBenfordAnalysis[["stats"]]$chisq$statistic,
                               thisBenfordAnalysis[["stats"]]$chisq$parameter,
                               thisBenfordAnalysis[["stats"]]$chisq$p.value))
  cat(pandoc.table(chisqStats))
  cat("\n\n##### Mantissa Arc Test\n")
  mantissaStats <- data.frame(description = 
                                c("L2",
                                  "df",
                                  "p-value <"),
                              value = 
                                c(thisBenfordAnalysis[["stats"]]$mantissa.arc.test$statistic,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$parameter,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$p.value))
  cat(pandoc.table(mantissaStats))
  cat("\n\nRemember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!\n\n")
}
```

from benford.analysis.

carloscinelli commented on August 23, 2024

Thanks for the comment, Henry! I will take a look at your example and try to explain it better. But just to give a quick answer, the focus here is to have one more indicator (which you will combine with other indicators) to find suspicious data that will need further investigation. You shouldn't use it to get a clear cut "yes/no" answer about the quality of a data set or even a single data point.

from benford.analysis.

vonjd commented on August 23, 2024

When will the vignette be ready?

from benford.analysis.

Write vignette about benford.analysis HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent