enc,krlmlr

Consider as.utf8 method for data frames?

Would re-encode:

names
levels of factors
character vectors

is_utf8 helper

We need some helper function that checks if a character vector is all utf-8.

Not sure how to distinguish this from a class based test for utf8.

Faster version of enc2latin1() and enc2alien()

in C code, borrow from R implementation.

Class that hosts a list of names

properly encoded in the native encoding. Can borrow code from bindr:::check_names().

@hadley: I think dplyr should use this class exclusively as storage for column names of any kind; this means that tibble offers a method to return column names as an object of this class.

utf8 should give advice on testing for encoding problems

i.e. how do you create an encoding problem that fails reliably on all platforms.

I think it's something like:

make sure you have "Encoding: UTF-8" in your DESCRIPTION (otherwise, use unicode escapes instead of literal characters)
use iconv to convert to latin-1 encoding. Maybe utf-8 should have a helper method for this case?

Maybe we should also provide testthat helper, expect_utf8() ?

Support for UTF-8 encoded language symbols

For tidyverse/dplyr#1950.

Integrate vctrs

so that dplyr can do efficient joins on character vectors.

@romainfrancois: This package implements an S3 class that makes sure that all strings are already UTF-8 encoded. What is the current state of UTF-8 enforcement in dplyr?

Get rid of utf8 class

Seems useless. Rename converter function to as_utf8.

Clarifying life cycle

And adding badge. Reference: r-lib/styler#439

Exception handling

Transforming files with a function that result in an error gives unexpected behaviour.

path1 <- tempfile()
writeLines("x", path1)
path2 <- tempfile()
writeLines("y", path2)
paths <- c(path1, path2)
utf8::transform_lines_enc(paths, function(x) stop(""))
#> Error in if (!any(ret)) { : missing value where TRUE/FALSE needed
#> In addition: Warning messages:
#> 1: When processing 
#> /var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T//RtmptsmbmE/file231c7396370c: this is a stop 
#> 2: When processing 
#> /var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T//RtmptsmbmE/file231caf8301d: this is a stop purrr::walk(paths, unlink)

Utf-8 class also needs to re-encode names

On creation and when modified with names<-

Consider using `LEVELS` instead of accessing sxpinfo directly

The following code from encoding.c could use LEVELS macro.

# define IS_BYTES(x) ((x)->sxpinfo.gp & BYTES_MASK)
# define IS_LATIN1(x) ((x)->sxpinfo.gp & LATIN1_MASK)
# define IS_ASCII(x) ((x)->sxpinfo.gp & ASCII_MASK)
# define IS_UTF8(x) ((x)->sxpinfo.gp & UTF8_MASK)
# define ENC_KNOWN(x) ((x)->sxpinfo.gp & (LATIN1_MASK | UTF8_MASK))

I believe that this is the right way of achieving this, since it works even if you do not define USE_RINTERNALS (but if you do, it compiles to the same machine code as before) and the "R Internals" document, the only documentation I could find for LEVELS, IMO clearly states that this is the contract of LEVELS:

The bits can be accessed and set by the LEVELS and SETLEVELS macros, which names appear
to date back to the internal factor and ordered types and are now used in only a few places in
the code. The gp field is serialized/unserialized for the SEXPTYPEs other than NILSXP, SYMSXP
and ENVSXP.

Moreover, even data.table uses this macro for the same purpose.

C implementation

of reading and writing.

Release enc 0.2.1

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()
Tweet

encoding is property of individual string, not entire vector
always use utf-8 encoding because ...
prefer packages that return utf-8 encoded values
prefer stringi/stringr over base functions
mention particularly problematic base functions (paste? gsub?)

Umlaute deleted

Any idea why on my Windows machine, the following happens?

tempfile <- tempfile()
text <- "glück"
library(enc)
write_lines_enc(text, tempfile)
read_lines_enc(tempfile)
#> [1] ""

Created on 2018-06-15 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Switzerland.1252     
#>  tz       Europe/Berlin               
#>  date     2018-06-15
#> Packages -----------------------------------------------------------------
#>  package   * version date       source                     
#>  backports   1.1.2   2017-12-13 CRAN (R 3.5.0)             
#>  base      * 3.5.0   2018-04-23 local                      
#>  compiler    3.5.0   2018-04-23 local                      
#>  datasets  * 3.5.0   2018-04-23 local                      
#>  devtools    1.13.5  2018-02-18 CRAN (R 3.5.0)             
#>  digest      0.6.15  2018-01-28 CRAN (R 3.5.0)             
#>  enc       * 0.1-10  2018-06-15 Github (krlmlr/enc@6b0171d)
#>  evaluate    0.10.1  2017-06-24 CRAN (R 3.5.0)             
#>  graphics  * 3.5.0   2018-04-23 local                      
#>  grDevices * 3.5.0   2018-04-23 local                      
#>  htmltools   0.3.6   2017-04-28 CRAN (R 3.5.0)             
#>  knitr       1.20    2018-02-20 CRAN (R 3.5.0)             
#>  magrittr    1.5     2014-11-22 CRAN (R 3.5.0)             
#>  memoise     1.1.0   2017-04-21 CRAN (R 3.5.0)             
#>  methods   * 3.5.0   2018-04-23 local                      
#>  Rcpp        0.12.17 2018-05-18 CRAN (R 3.5.0)             
#>  rmarkdown   1.9     2018-03-01 CRAN (R 3.5.0)             
#>  rprojroot   1.3-2   2018-01-03 CRAN (R 3.5.0)             
#>  stats     * 3.5.0   2018-04-23 local                      
#>  stringi     1.1.7   2018-03-12 CRAN (R 3.5.0)             
#>  stringr     1.3.1   2018-05-10 CRAN (R 3.5.0)             
#>  tools       3.5.0   2018-04-23 local                      
#>  utils     * 3.5.0   2018-04-23 local                      
#>  withr       2.1.2   2018-03-15 CRAN (R 3.5.0)             
#>  yaml        2.1.19  2018-05-01 CRAN (R 3.5.0)