Giter Site home page Giter Site logo

minty's Introduction

minty

R-CMD-check CRAN status

minty (Minimal type guesser) is a package with the type inferencing and parsing tools (the so-called 1e parsing engine) extracted from readr (with permission, see this issue tidyverse/readr#1517). Since July 2021, these tools are not used internally by readr for parsing text files. Now vroom is used by default, unless explicitly call the first edition parsing engine (see the explanation on editions).

readr’s 1e type inferencing and parsing tools are used by various R packages, e.g.Β readODS and surveytoolbox for parsing in-memory objects, but those packages do not use the main functions (e.g.Β readr::read_delim()) of readr. As explained in the README of readr, those 1e code will be eventually removed from readr.

minty aims at providing a set of minimal, long-term, and compatible type inferencing and parsing tools for those packages. You might consider minty to be 1.5e parsing engine.

Installation

You can install the development version of minty like so:

if (!require("remotes")){
    install.packages("remotes")
}
remotes::install_github("gesistsa/minty")

Example

A character-only data.frame

text_only <- data.frame(maybe_age = c("17", "18", "019"),
                        maybe_male = c("true", "false", "true"),
                        maybe_name = c("AA", "BB", "CC"),
                        some_na = c("NA", "Not good", "Bad"),
                        dob = c("2019/07/21", "2019/08/31", "2019/10/01"))
str(text_only)
#> 'data.frame':    3 obs. of  5 variables:
#>  $ maybe_age : chr  "17" "18" "019"
#>  $ maybe_male: chr  "true" "false" "true"
#>  $ maybe_name: chr  "AA" "BB" "CC"
#>  $ some_na   : chr  "NA" "Not good" "Bad"
#>  $ dob       : chr  "2019/07/21" "2019/08/31" "2019/10/01"
## built-in function type.convert:
## except numeric, no type inferencing
str(type.convert(text_only, as.is = TRUE))
#> 'data.frame':    3 obs. of  5 variables:
#>  $ maybe_age : int  17 18 19
#>  $ maybe_male: chr  "true" "false" "true"
#>  $ maybe_name: chr  "AA" "BB" "CC"
#>  $ some_na   : chr  NA "Not good" "Bad"
#>  $ dob       : chr  "2019/07/21" "2019/08/31" "2019/10/01"

Inferencing the column types

library(minty)
data <- type_convert(text_only)
data
#>   maybe_age maybe_male maybe_name  some_na        dob
#> 1        17       TRUE         AA     <NA> 2019-07-21
#> 2        18      FALSE         BB Not good 2019-08-31
#> 3       019       TRUE         CC      Bad 2019-10-01
str(data)
#> 'data.frame':    3 obs. of  5 variables:
#>  $ maybe_age : chr  "17" "18" "019"
#>  $ maybe_male: logi  TRUE FALSE TRUE
#>  $ maybe_name: chr  "AA" "BB" "CC"
#>  $ some_na   : chr  NA "Not good" "Bad"
#>  $ dob       : Date, format: "2019-07-21" "2019-08-31" ...

Type-based parsing tools

parse_datetime("1979-10-14T10:11:12.12345")
#> [1] "1979-10-14 10:11:12 UTC"
fr <- locale("fr")
parse_date("1 janv. 2010", "%d %b %Y", locale = fr)
#> [1] "2010-01-01"
de <- locale("de", decimal_mark = ",")
parse_number("1.697,31", local = de)
#> [1] 1697.31
parse_number("$1,123,456.00")
#> [1] 1123456
## This is perhaps Python
parse_logical(c("True", "False"))
#> [1]  TRUE FALSE

Type guesser

parse_guess(c("True", "TRUE", "false", "F"))
#> [1]  TRUE  TRUE FALSE FALSE
parse_guess(c("123.45", "1990", "7619.0"))
#> [1]  123.45 1990.00 7619.00
res <- parse_guess(c("2019-07-21", "2019-08-31", "2019-10-01", "IDK"), na = "IDK")
res
#> [1] "2019-07-21" "2019-08-31" "2019-10-01" NA
str(res)
#>  Date[1:4], format: "2019-07-21" "2019-08-31" "2019-10-01" NA

Differences: readr vs minty

Unlike readr and vroom, please note that minty is mainly for non-interactive usage. Therefore, minty emits fewer messages and warnings than readr and vroom.

data <- minty::type_convert(text_only)
data
#>   maybe_age maybe_male maybe_name  some_na        dob
#> 1        17       TRUE         AA     <NA> 2019-07-21
#> 2        18      FALSE         BB Not good 2019-08-31
#> 3       019       TRUE         CC      Bad 2019-10-01
data <- readr::type_convert(text_only)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   maybe_age = col_character(),
#>   maybe_male = col_logical(),
#>   maybe_name = col_character(),
#>   some_na = col_character(),
#>   dob = col_date(format = "")
#> )
data
#>   maybe_age maybe_male maybe_name  some_na        dob
#> 1        17       TRUE         AA     <NA> 2019-07-21
#> 2        18      FALSE         BB Not good 2019-08-31
#> 3       019       TRUE         CC      Bad 2019-10-01

verbose option is added if you like those messages, default to FALSE. To keep this package as minimal as possible, these optional messages are printed with base R (not cli).

data <- minty::type_convert(text_only, verbose = TRUE)
#> Column specification:
#> cols(  maybe_age = col_character(),  maybe_male = col_logical(),  maybe_name = col_character(),  some_na = col_character(),  dob = col_date(format = ""))

At the moment, minty does not use the problems mechanism by default.

minty::parse_logical(c("true", "fake", "IDK"), na = "IDK")
#> [1] TRUE   NA   NA
readr::parse_logical(c("true", "fake", "IDK"), na = "IDK")
#> Warning: 1 parsing failure.
#> row col           expected actual
#>   2  -- 1/0/T/F/TRUE/FALSE   fake
#> [1] TRUE   NA   NA
#> attr(,"problems")
#> # A tibble: 1 Γ— 4
#>     row   col expected           actual
#>   <int> <int> <chr>              <chr> 
#> 1     2    NA 1/0/T/F/TRUE/FALSE fake

Some features from vroom have been ported to minty, but not readr.

## tidyverse/readr#1526
minty::type_convert(data.frame(a = c("NaN", "Inf", "-INF"))) |> str()
#> 'data.frame':    3 obs. of  1 variable:
#>  $ a: num  NaN Inf -Inf
readr::type_convert(data.frame(a = c("NaN", "Inf", "-INF"))) |> str()
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   a = col_character()
#> )
#> 'data.frame':    3 obs. of  1 variable:
#>  $ a: chr  "NaN" "Inf" "-INF"

guess_max is available for parse_guess() and type_convert(), default to NA (same as readr).

minty::parse_guess(c("1", "2", "drei"))
#> [1] "1"    "2"    "drei"
minty::parse_guess(c("1", "2", "drei"), guess_max = 2)
#> [1]  1  2 NA
readr::parse_guess(c("1", "2", "drei"))
#> [1] "1"    "2"    "drei"

For parse_guess() and type_convert(), trim_ws is considered before type guessing (the expected behavior of vroom::vroom() / readr::read_delim()).

minty::parse_guess(c("   1", " 2 ", " 3  "), trim_ws = TRUE)
#> [1] 1 2 3
readr::parse_guess(c("   1", " 2 ", " 3  "), trim_ws = TRUE)
#> [1] "1" "2" "3"
##tidyverse/readr#1536
minty::type_convert(data.frame(a = "1 ", b = " 2"), trim_ws = TRUE) |> str()
#> 'data.frame':    1 obs. of  2 variables:
#>  $ a: num 1
#>  $ b: num 2
readr::type_convert(data.frame(a = "1 ", b = " 2"), trim_ws = TRUE) |> str()
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   a = col_character(),
#>   b = col_double()
#> )
#> 'data.frame':    1 obs. of  2 variables:
#>  $ a: chr "1"
#>  $ b: num 2

Similar packages

For parsing ambiguous date(time)

Guess column types of a text file

Acknowledgements

Thanks to:

minty's People

Contributors

chainsawriot avatar

Stargazers

Andrew Allen Bruce avatar Leo Lee avatar Joseph Wood avatar Kyle Husmann avatar David Schoch avatar

Watchers

 avatar  avatar

minty's Issues

Bug to bug compatibility with `readr::type_convert()`, but is it correct?

Ref #20 tidyverse/readr#1509

text_only <- structure(list(weight = c("4.17"),
                                group = c("ctrl")),
                           class = "data.frame", row.names = c(NA, -1L))

unnamed_cols <- readr::cols(readr::col_skip(), readr::col_guess())
named_cols <- readr::cols(weight = readr::col_skip(), group = readr::col_guess())
partially_named_cols <- readr::cols(weight = readr::col_skip())

readr::type_convert(text_only, unnamed_cols) ## won't skip?
#>   weight group
#> 1   4.17  ctrl
readr::type_convert(text_only, named_cols) ## skip
#>   group
#> 1  ctrl
readr::type_convert(text_only, partially_named_cols) ## skip
#>   group
#> 1  ctrl

minty::type_convert(text_only, unnamed_cols) ## bug-to-bug compatibility?
#>   weight group
#> 1   4.17  ctrl
minty::type_convert(text_only, named_cols) ## skip
#>   group
#> 1  ctrl
minty::type_convert(text_only, partially_named_cols) ## skip
#>   group
#> 1  ctrl

## but it is probably not the expected behavior for the so-called 2e from vroom

readr::read_csv(I("weight,group\n4.17,ctrl\n"), col_names = TRUE, col_types = unnamed_cols)
#> # A tibble: 1 Γ— 1
#>   group
#>   <chr>
#> 1 ctrl
readr::read_csv(I("weight,group\n4.17,ctrl\n"), col_names = TRUE, col_types = named_cols)
#> # A tibble: 1 Γ— 1
#>   group
#>   <chr>
#> 1 ctrl
readr::read_csv(I("weight,group\n4.17,ctrl\n"), col_names = TRUE, col_types = partially_named_cols)
#> # A tibble: 1 Γ— 1
#>   group
#>   <chr>
#> 1 ctrl

Created on 2024-03-19 with reprex v2.1.0

CRAN Submission issues

For the CRAN check, there are two issues that make the incoming checks fail.

  • r-lib/cpp11#355
  • And the word "inferencing" is not in CRAN's dictionary of valid words

These two issues make the submission process very annoying. It's still not on cransay. I can't solve the issue of cpp11. But "inferencing" can be fixed here.

Maybe just call this "Minimal type converter" to reduce the friction.

Prevent namespace conflicts with `readr` and `vroom`

  • Don't export S3 methods (as much as possible)
    • format.col_spec
    • as.character.col_spec
    • all print methods
    • str.col_spec
  • Add readr to Suggests
  • parse_vector() and guess_parser() are confusing for non-interactive usage and should not be exported
  • Can col_spec_standardise still work without those exports?

Implementation of `guess_max`

In many cases we don't need to go through the entire vector to guess the type, apropos readxl::read_excel(guess_max) and vroom::vroom(guess_max).

It seems that we can implement the same thing here

bool canParse(
const cpp11::strings& x, const canParseFun& canParse, LocaleInfo* pLocale) {
for (const auto & i : x) {
if (i == NA_STRING) {
continue;
}
if (i.size() == 0) {
continue;
}
if (!canParse(std::string(i), pLocale)) {
return false;
}
}
return true;
}

Instead of going through x entirely (for (const auto & i : x)), just go through a subset of the first guess_max items.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.