Giter Site home page Giter Site logo

enc's Introduction

enc

Lifecycle: experimental rcc codecov CRAN_Status_Badge

Portable tools for UTF-8 character data

R and character encoding

The character encoding of determines the translation of the letters, digits, or other codepoints (atomic components of a text) into a sequence of bytes. A byte sequence may translate into valid text in one character encoding, but give nonsense in other character encodings.

For historic reasons, R can store strings in different ways:

  1. in the “native” encoding, the default encoding of the operating system
  2. in UTF-8, the most prevalent and versatile encoding nowadays
  3. in “latin1”, a popular encoding in Western Europe
  4. as “bytes”, leaving the interpretation to the user

On OS X and Linux, the “native” encoding is often UTF-8, but on Windows it is not. To add to the confusion, the encoding is a property of individual strings in a character vector, and not of the entire vector.

Why UTF-8?

When working with text, it is advisable to use UTF-8, because it allows encoding virtually any text, even in foreign languages that contain symbols that cannot be represented in your system’s native encoding. The UTF-8 encoding possesses several nice technical properties, and is by far the predominant encoding on the Web. Standardization on a “universal” encoding faciliates data exchange.

Because of R’s special handling of strings, some care must be taken to make sure that you’re actually using the UTF-8 encoding. Many functions in R will hide encoding issues from you, and transparently convert to UTF-8 as necessary. However, some functions (such as reading and writing files) will stubbornly prefer the native encoding.

The enc package provides helpers for converting all textual components of an object to UTF-8, and for reading and writing files in UTF-8 (with a LF end-of-line terminator by default). It also defines an S3 class for tagging all-UTF-8 character vectors and ensuring that updates maintain the UTF-8 encoding. Examples for other packages that use UTF-8 by default are:

Example

library(enc)
utf8(c("a", "ä"))
#> [1] "a" "ä"
as_utf8(1)
#> [1] "1"

a <- utf8("ä")
a[2] <- "ö"
class(a)
#> [1] "utf8"

data.frame(abc = letters[1:3], utf8 = utf8(letters[1:3]))
#>   abc utf8
#> 1   a    a
#> 2   b    b
#> 3   c    c

Install the package from GitHub:

# install.packages("devtools")
devtools::install_github("krlmlr/enc")

enc's People

Contributors

dpprdan avatar indrajeetpatil avatar krlmlr avatar lorenzwalthert avatar yutannihilation avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

enc's Issues

is_utf8 helper

We need some helper function that checks if a character vector is all utf-8.

Not sure how to distinguish this from a class based test for utf8.

Class that hosts a list of names

properly encoded in the native encoding. Can borrow code from bindr:::check_names().

@hadley: I think dplyr should use this class exclusively as storage for column names of any kind; this means that tibble offers a method to return column names as an object of this class.

utf8 should give advice on testing for encoding problems

i.e. how do you create an encoding problem that fails reliably on all platforms.

I think it's something like:

  • make sure you have "Encoding: UTF-8" in your DESCRIPTION (otherwise, use unicode escapes instead of literal characters)
  • use iconv to convert to latin-1 encoding. Maybe utf-8 should have a helper method for this case?

Maybe we should also provide testthat helper, expect_utf8() ?

Integrate vctrs

so that dplyr can do efficient joins on character vectors.

@romainfrancois: This package implements an S3 class that makes sure that all strings are already UTF-8 encoded. What is the current state of UTF-8 enforcement in dplyr?

Exception handling

Transforming files with a function that result in an error gives unexpected behaviour.

path1 <- tempfile()
writeLines("x", path1)
path2 <- tempfile()
writeLines("y", path2)
paths <- c(path1, path2)
utf8::transform_lines_enc(paths, function(x) stop(""))
#> Error in if (!any(ret)) { : missing value where TRUE/FALSE needed
#> In addition: Warning messages:
#> 1: When processing 
#> /var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T//RtmptsmbmE/file231c7396370c: this is a stop 
#> 2: When processing 
#> /var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T//RtmptsmbmE/file231caf8301d: this is a stop purrr::walk(paths, unlink)

Consider using `LEVELS` instead of accessing sxpinfo directly

The following code from encoding.c could use LEVELS macro.

# define IS_BYTES(x) ((x)->sxpinfo.gp & BYTES_MASK)
# define IS_LATIN1(x) ((x)->sxpinfo.gp & LATIN1_MASK)
# define IS_ASCII(x) ((x)->sxpinfo.gp & ASCII_MASK)
# define IS_UTF8(x) ((x)->sxpinfo.gp & UTF8_MASK)
# define ENC_KNOWN(x) ((x)->sxpinfo.gp & (LATIN1_MASK | UTF8_MASK))

I believe that this is the right way of achieving this, since it works even if you do not define USE_RINTERNALS (but if you do, it compiles to the same machine code as before) and the "R Internals" document, the only documentation I could find for LEVELS, IMO clearly states that this is the contract of LEVELS:

The bits can be accessed and set by the LEVELS and SETLEVELS macros, which names appear
to date back to the internal factor and ordered types and are now used in only a few places in
the code. The gp field is serialized/unserialized for the SEXPTYPEs other than NILSXP, SYMSXP
and ENVSXP.

Moreover, even data.table uses this macro for the same purpose.

Release enc 0.2.1

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

README needs to give more advice

  • encoding is property of individual string, not entire vector
  • always use utf-8 encoding because ...
  • prefer packages that return utf-8 encoded values
  • prefer stringi/stringr over base functions
  • mention particularly problematic base functions (paste? gsub?)

Umlaute deleted

Any idea why on my Windows machine, the following happens?

tempfile <- tempfile()
text <- "glück"
library(enc)
write_lines_enc(text, tempfile)
read_lines_enc(tempfile)
#> [1] ""

Created on 2018-06-15 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Switzerland.1252     
#>  tz       Europe/Berlin               
#>  date     2018-06-15
#> Packages -----------------------------------------------------------------
#>  package   * version date       source                     
#>  backports   1.1.2   2017-12-13 CRAN (R 3.5.0)             
#>  base      * 3.5.0   2018-04-23 local                      
#>  compiler    3.5.0   2018-04-23 local                      
#>  datasets  * 3.5.0   2018-04-23 local                      
#>  devtools    1.13.5  2018-02-18 CRAN (R 3.5.0)             
#>  digest      0.6.15  2018-01-28 CRAN (R 3.5.0)             
#>  enc       * 0.1-10  2018-06-15 Github (krlmlr/enc@6b0171d)
#>  evaluate    0.10.1  2017-06-24 CRAN (R 3.5.0)             
#>  graphics  * 3.5.0   2018-04-23 local                      
#>  grDevices * 3.5.0   2018-04-23 local                      
#>  htmltools   0.3.6   2017-04-28 CRAN (R 3.5.0)             
#>  knitr       1.20    2018-02-20 CRAN (R 3.5.0)             
#>  magrittr    1.5     2014-11-22 CRAN (R 3.5.0)             
#>  memoise     1.1.0   2017-04-21 CRAN (R 3.5.0)             
#>  methods   * 3.5.0   2018-04-23 local                      
#>  Rcpp        0.12.17 2018-05-18 CRAN (R 3.5.0)             
#>  rmarkdown   1.9     2018-03-01 CRAN (R 3.5.0)             
#>  rprojroot   1.3-2   2018-01-03 CRAN (R 3.5.0)             
#>  stats     * 3.5.0   2018-04-23 local                      
#>  stringi     1.1.7   2018-03-12 CRAN (R 3.5.0)             
#>  stringr     1.3.1   2018-05-10 CRAN (R 3.5.0)             
#>  tools       3.5.0   2018-04-23 local                      
#>  utils     * 3.5.0   2018-04-23 local                      
#>  withr       2.1.2   2018-03-15 CRAN (R 3.5.0)             
#>  yaml        2.1.19  2018-05-01 CRAN (R 3.5.0)

Sort keys

Computed lazily, for fast locale-aware sorting with sort(... = "radix")

See bottom of http://userguide.icu-project.org/collation

We might be able to define a method for xtfrm(), so that computation and application of sort keys is completely transparent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.