ropensci / spelling Goto Github PK

View Code? Open in Web Editor NEW

103.0 11.0 27.0 131 KB

Tools for Spell Checking in R

Home Page: https://docs.ropensci.org/spelling

License: Other

R 100.00%

spell-check spell-checker r rstats spelling spellcheck spellchecker r-package

spelling's Introduction

spelling

Tools for Spell Checking in R

Spell checking common document formats including latex, markdown, manual pages, and description files. Includes utilities to automate checking of documentation and vignettes as a unit test during 'R CMD check'. Both British and American English are supported out of the box and other languages can be added. In addition, packages may define a 'wordlist' to allow custom terminology without having to abuse punctuation.

Spell Check Single Files

The function spell_check_files automatically parses known text formats and only spell checks text blocks, not code chunks.

spell_check_files('README.md', lang = 'en_US')
#   WORD       FOUND IN
# AppVeyor   README.md:5
# CMD        README.md:12
# RStudio    README.md:8

For more information about the underlying spelling engine and how to add support for other languages, see the hunspell package.

Spell Check a Package

Spell check documentation, description, readme, and vignettes of a package:

spell_check_package("~/workspace/V8")
# DESCRIPTION does not contain 'Language' field. Defaulting to 'en-US'.
#   WORD          FOUND IN
# ECMA          V8.Rd:16, description:2,4
# emscripten    description:5
# htmlwidgets   JS.Rd:16
# JSON          V8.Rd:33,38,39,57,58,59,121
# jsonlite      V8.Rd:42
# Ooms          V8.Rd:41,121
# th            description:3
# Xie           JS.Rd:26
# Yihui         JS.Rd:26

Review these words and then update the wordlist to allow them:

update_wordlist("~/workspace/V8")
# The following words will be added to the wordlist:
#  - ECMA
#  - emscripten
#  - htmlwidgets
#  - JSON
#  - jsonlite
#  - Ooms
#  - th
#  - Xie
#  - Yihui
# Are you sure you want to update the wordlist?
# 1: Yes
# 2: No

Then these will no longer be marked as errors:

> spell_check_package("~/workspace/V8")
No spelling errors found.

Automate Package Spell Checking

Use spell_check_setup() to add a unit test to your package which automatically runs a spell check on documentation and vignettes during R CMD check if the environment variable NOT_CRAN is set to TRUE. By default this unit test never fails; it merely prints potential spelling errors to the console.

spell_check_setup("~/workspace/V8")
# Adding 'Language: en-US' to DESCRIPTION
# No changes required to /Users/jeroen/workspace/V8/inst/WORDLIST
# Updated /Users/jeroen/workspace/V8/tests/spelling.R

Note that the NOT_CRAN variable is automatically set to 1 on Travis and in devtools or RStudio, otherwise you need to set it yourself:

export NOT_CRAN=1
R CMD check V8_1.5.9000.tar.gz
# * using log directory ‘/Users/jeroen/workspace/V8.Rcheck’
# * using R version 3.5.1 (2018-07-02)
# * using platform: x86_64-apple-darwin15.6.0 (64-bit)
# ...
# ...
# * checking tests ...
#   Running ‘spelling.R’
#   Comparing ‘spelling.Rout’ to ‘spelling.Rout.save’ ... OK
#   Running ‘testthat.R’
#  OK

spelling's People

Contributors

Stargazers

Watchers

spelling's Issues

References count as spelling errors in spell_check_files() with Rmd

When using references in RMarkdown, it looks like they count as spelling errors (also, keys and values in the YAML header show up as errors, but these are pretty easy to filter out and might even be necessary if the title field needs checking). I'd be happy to make a PR on this, but not sure what the approach should be. Maybe a user-defined filter function that can exclude certain regexes from being checked?

biblio_file <- tempfile(fileext = ".bib")
rmd_file <- tempfile(fileext = ".Rmd")

writeLines(
  paste(
    "@article{dunnington16,",
    "  title = {A geochemical perspective on the impact of development at {Alta} {Lake}, {British} {Columbia}, {Canada}},",
    "  volume = {56},",
    "  doi = {10.1007/s10933-016-9919-x},",
    "  number = {4},",
    "  journal = {Journal of Paleolimnology},",
    "  author = {Dunnington, Dewey W. and Spooner, Ian S. and White, Chris E. and Cornett, R. Jack and Williamson, Dave and Nelson, Mike},",
    "  month = nov,",
    "  year = {2016},",
    "  pages = {315-330},",
    "}",
    sep = "\n"
  ),
  biblio_file
)

writeLines(
  paste(
    "---",
    "output: word_document",
    sprintf("bibliography: %s", biblio_file),
    "---",
    "",
    "Everything @dunnington16 says is obviously correct",
    "",
    "Lakes are fantastic [@dunnington16]",
    "",
    "This Dunnington fellow really has things figured out [-@dunnington16]",
    sep = "\n"
  ),
  rmd_file
)

cat(paste(readLines(rmd_file), collapse = "\n"))
#> ---
#> output: word_document
#> bibliography: /var/folders/bq/2rcjstv90nx1_wrt8d3gqw6m0000gn/T//Rtmp95OnIv/file3d521f8518fb.bib
#> ---
#> 
#> Everything @dunnington16 says is obviously correct
#> 
#> Lakes are fantastic [@dunnington16]
#> 
#> This Dunnington fellow really has things figured out [-@dunnington16]
spelling::spell_check_files(rmd_file)
#>   WORD         FOUND IN
#> bq           file3d525b2b16e7.Rmd:2
#> dunnington   file3d525b2b16e7.Rmd:6,8,10
#> Dunnington   file3d525b2b16e7.Rmd:10
#> fb           file3d525b2b16e7.Rmd:2
#> gn           file3d525b2b16e7.Rmd:2
#> gqw          file3d525b2b16e7.Rmd:2
#> nx           file3d525b2b16e7.Rmd:2
#> OnIv         file3d525b2b16e7.Rmd:2
#> rcjstv       file3d525b2b16e7.Rmd:2
#> Rtmp         file3d525b2b16e7.Rmd:2

contractions are spelling errors?

> spelling::spell_check_text("this package doesn't like contractions like isn't, aren't")
   word found
1  aren     1
2 doesn     1
3   isn     1

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] corpus_0.9.1.9000

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12    roxygen2_6.0.1  lattice_0.20-35 digest_0.6.12  
 [5] crayon_1.3.2    withr_2.0.0     commonmark_1.2  grid_3.4.1     
 [9] R6_2.2.2        magrittr_1.5    stringi_1.1.5   testthat_1.0.2 
[13] xml2_1.1.1      Matrix_1.2-10   devtools_1.13.1 tools_3.4.1    
[17] stringr_1.2.0   hunspell_2.5    compiler_3.4.1  spelling_1.0   
[21] memoise_1.1.0

Better yaml frontmatter parsing

In rmarkdown documents, ignore the rmarkdown::yaml_front_matter except for title/subtitle fields.

CRAN spelling: spell_check_files with files in different directories

Hi,

I have a list of files in different directories. When I call spell_check_files(files) then the result returns in found only the basename of the file and the line number. Would it be possible to return the pathes, too?

SIgbert

spell_check_bookdown

Similar to spell_check_package() i.e. would use the WORDLIST.

Language as argument, NULL by default. If not given, try to find for language info in a DESCRIPTION file, otherwise assume US English.

Checks

Rmd
README.md if it exists and if README.Rmd doesn't exist.
NEWS.md if it exists.
DESCRIPTION if it exists (like the package function does)

Or maybe it could be simpified somehow into checking all Rmd, and all md without Rmd source 🤔

Documentation enhancement: spell_check_setup

spell_check_setup() should list and explain the changes it makes to DESCRIPTION in its documentation.

Add possibility to add new words to a wordlist from files, not only from package

The only function to update wordlist is update_wordlist and it only works on a package.
My use-case is a blog-post and there I only have one file. Would it be possible to add a similar function just for files?

Spellcheck fails in R CMD check due to fixed WORDLIST link

We created a test running:

spell_check_package(pkg, vignettes = TRUE, use_wordlist = TRUE)

In case this runs in R CMD check it fails on words which are obviously in the WORDLIST, This is the case for our NEWS.md and DESCRIPTION file. Not for .Rd files.

We noticed upon creating inst/inst/WORDLIST this error disappears. Can you make the location of the wordlist flexible in the call of spell_check_package? We can insert an if clause to check the mode (test / R CMD check) on our own.

Thanks

Spell check Roxygen documentation comments

Is there any way to spell check just the Roxygen documentation tags '# rather than the generated man files? This would make fixing the spelling errors much more convenient. Thanks!

Make NEWS pkgdown-compatible?

Changes needed

Use of Markdown syntax
Putting the package name in section names (each section = "spelling versionnumber").

I can make a PR if you're ok with that @jeroen.

Advantage = getting a changelog in the docs website like e.g. https://docs.ropensci.org/codemetar/news/index.html

WISH: Add support for .aspell/defaults.R and .aspell/WORDLIST.rds

Background

Per help("aspell-utils", package = "utils"), it's possible to add custom dictionaries controlled via .aspell/defaults.R, e.g.

Rd_files <- vignettes <- R_files <- description <-
    list(encoding = "UTF-8",
         language = "en",
         dictionaries = c("en_stats", "WORDLIST"))

where WORDLIST refers to .aspell/WORDLIST.rds, which comprise an acceptance word list, e.g.

saveRDS(accepted_words, file = ".aspell/WORDLIST.rds", version = 2L)

This can be used to avoid R CMD check --as-cran NOTEs (which often are reported by the win-builder or CRAN Incoming services), e.g. Possibly mis-spelled words in DESCRIPTION: ...

Wish

It would be great if spelling could provide "standards" and functions for setting this up.

Also, maybe spelling could fall back to .aspell/WORDLIST.rds, if inst/WORDLIST is not found.

References

Error in read_xml.raw

Running spelling::spell_check_test() fails on the report package with the following error:

Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  PCDATA invalid Char value 27 [9]

I seems like the error comes from the README.md, which indeed contains some special characters like percentages etc.

May I know how is it possible either to skip the spelling of that file, or to skip the problematic characters? Thank you

URLs in Description are treated as spelling errors

URLs specified in the Description field of the DESCRIPTION file should not be spell-checked.

URLs in Descriptionare common / encouraged to provide references, and enclosed in angle brackets, which should make them easy to detect and exclude from spell-checks.

Partial matching issue with `pkg$author`

spelling/R/spell-check.R

Line 33 in 7f5e3f6

author <- strsplit(pkg$author, " ", fixed = TRUE)[[1]]

will return the author@R term if author does not exist due to partial matching of $. This should be changed to pkg[["author"]] and additional code added to parse the author@R separately.

"Package suggested but not available" during GitHub Action CI

Hello!

When I trigger a CI check with GitHub Action, the R-devel configuration of the standard check (as defined here) fails because it can't find spelling...

* checking package dependencies ... ERROR
##[error]Package suggested but not available: ‘spelling’

See for example here but it happens with other packages.

This behavior is not observed with Travis.

It may be liked with #50

Thanks :)

spell_check_test causes warning inside codecov inside GitHub actions

Hi! :)

I am using a GitHub action from r-lib to run covr on my code. Since I added a spell_check_test (using usethis::use_spell_check), I get a warning in the covr GitHub action log:

files differ in number of lines:
6,8c6
< Warning message:
< In spelling::spell_check_test(error = TRUE) :
<   Failed to find package source directory

I get no warnings when I run covr locally. It also runs without a warning in the R CMD check GitHub action.

Example commit: https://github.com/and3k/dtutils/runs/564517180 (see "Test coverage" step)

Thanks!
Bela

Ignore fancy quotes

I have this in an Rmd:

See ['Configuration'][pkg_config] for details.

which will have fancy quotes in the README.md, courtesy of pandoc I guess:

See [‘Configuration’](TODO) for details.

and then spellcheck reports:

Configuration’ README.md:177

I wonder if it would be easy to ignore the fancy quotes? TBH I am not sure why they are considered to be part of the word.

Exclude specific files from spell check similar to a `.gitignore` file

Similar to the white listed WORDLIST file, but exclude an entire file, similar to a .gitignore or .lintr. The use case I have is a single foo.Rd file has 100s of failing words (gene sequence definitions) that are valid in that context, but I do not want to bloat the WORDLIST with these idiosyncratic, highly specific set of words that are unique to foo.Rd.

Function use cases:

spelling::spell_check_package()
spelling::spell_check_files()

Best case scenario is this is already possible and I've simply missed it. Thx!

spell_check_files: FOUND IN file name is mixed up when path isn't sorted

I put MistakeA in fileA.txt and MistakeB in fileB.txt. When

path <- c("fileB.txt", "fileA.txt")

then spell_check_files(path) says MistakeA is found in fileB.txt and vice versa

library(spelling)

Create `fileA.txt` and `fileB.txt`

fileA <- '
store
car
MistakeA
road
'

fileB <- '
store
MistakeB
desk
road
'

writeLines(fileA, con = "fileA.txt")
writeLines(fileB, con = "fileB.txt")


files <- c("fileB.txt", "fileA.txt")

Check each file separately - `FOUND IN` is correct

l1 <- lapply(files, spell_check_files)
do.call("rbind", l1)
#>   WORD       FOUND IN
#> MistakeB   fileB.txt:3
#> MistakeA   fileA.txt:4

Check in one run - `FOUND IN` is mixed up

spell_check_files(files)
#>   WORD       FOUND IN
#> MistakeA   fileB.txt:4
#> MistakeB   fileA.txt:3

Sort then check in one run - `FOUND IN` is correct

spell_check_files(sort(files))
#>   WORD       FOUND IN
#> MistakeA   fileA.txt:4
#> MistakeB   fileB.txt:3

^{Created on 2019-02-02 by the reprex package (v0.2.1)}

spell_check_package returns error in read_xml.raw

This might be related to #6, I've been unable to successfully troubleshoot the issue on my end. The following spelling.R file is setup to run spell check my package:

if(requireNamespace('spelling', quietly = TRUE))
  spelling::spell_check_test(error = FALSE,
                             skip_on_cran = TRUE)

When I run devtools::check(args = c('--as-cran')):

    Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
       PCDATA invalid Char value 27 [9]
     Calls: <Anonymous> ... xml_find_all -> <Anonymous> -> read_xml.character -> read_xml.raw
     Execution halted

This isn't too informative. I don't know if spelling checks the html vignettes, but that was my initial thought. So I tried:

if(requireNamespace('spelling', quietly = TRUE))
  spelling::spell_check_test(vignettes = FALSE, error = FALSE,
                             skip_on_cran = TRUE)

And the check is successful.

My final check was to run spelling::spell_check_files() on the Rmd, html, and .r vignette files. These printed spelling errors but did not fail like above.

My big question is how do I troubleshoot this message? For reference the package is tbrf and the test failures are shown: https://github.com/mps9506/tbrf/runs/686572191?check_suite_focus=true#step:10:152

Include README.md/Rmd in spell_check_package()

I suggest that one include the README in spell_check_package(). Another option would be to find all Rmd or md files in the package, including ones outside vignettes/, such as ones that may be in inst.

spell_check_package() to include NEWS and ChangeLog too

spell_check_package() checks NEWS.md but not plain-text NEWS files. Please consider adding that. It's probably as simple as adding NEWS to the list of files recognized.

Links get treated as text in commonmark 1.9.0

For example:

cat(commonmark::markdown_xml('A link: https://crandb.r-pkg.org is good', extensions = TRUE, sourcepos = TRUE))

now gives:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-1:40" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-1:40">
    <text sourcepos="1:1-1:13" xml:space="preserve">A link: </text>
    <link sourcepos="1:8-1:32" destination="https://crandb.r-pkg.org" title="">
      <text sourcepos="1:8-1:32" xml:space="preserve">https://crandb.r-pkg.org</text>
    </link>
    <text sourcepos="1:33-1:40" xml:space="preserve"> is good</text>
  </paragraph>
</document>

Whereas in previous versions of commonmark it would give:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-1:40" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-1:40">
    <text sourcepos="1:1-1:13" xml:space="preserve">A link: </text>
    <link destination="https://crandb.r-pkg.org" title="">
      <text xml:space="preserve">https://crandb.r-pkg.org</text>
    </link>
    <text sourcepos="1:33-1:40" xml:space="preserve"> is good</text>
  </paragraph>
</document>

Note how <link> has gained a sourcepos attribute. I think this is what causes them to be spellchecked.

@gaborcsardi

Spell check templates in package

I have some Rmarkdown templates in inst/templates in my package.

Is there was a way to get spelling::spell_check_package() to also check these files?

urls are treated as spelling errors

urls in markdown (md) documents are spell-checked in spell_check_files() and spell_check_package() as long as they have angle brackets around them. This is inconsistent with the behavior of links without angle brackets, which are not spellchecked.

Here is a repex where I create a markdown file from an rmarkdown file using rmarkdown::github_document().

writeLines(con = "test.Rmd", text = "
---
output: github_document
---
 
https://github.com/ropensci/spelling/issues/21
")

rmarkdown::render("test.Rmd", quiet = TRUE)

cat(readLines("test.md"), sep = "\n")
#> 
#> <https://github.com/ropensci/spelling/issues/21>

spelling::spell_check_files(c("test.md", "test.Rmd"))
#>   WORD       FOUND IN
#> github     test.md:2
#> https      test.md:2
#> ropensci   test.md:2

As we can see, only the urls in the .md file are spellchecked. If the angle brackets get removed, all spell checks pass.

library(magrittr)
readLines("test.md") %>% sub("<", "", .) %>% 
  sub(">", "", .) %>% writeLines("test.md")

spelling::spell_check_files(c("test.md", "test.Rmd"))
#> No spelling errors found.

It would be very handy if this could be fixed in the spelling package, since I am using rmarkdown::github_document() for most of my R packages and I don't see an elegant way to run spell_check_package() without getting spellcheck-warnings because of this behavior.

knitr:::file_ext

FYI I have removed the unexported function file_ext from knitr: yihui/knitr@cd6bed6 This function has been moved to the xfun package.

Diacritics and spelling errors

If I have words with accented characters, e.g. moiré (as in "moiré vibrations"), then running

spelling::spell_check_files(path = list.files(pattern = ".Rmd"),
                            lang = 'en-US'

marks this as a spelling error: moirÃ.

Any possibility of avoiding this?

include .Rd files in spell_check_files

Hi,
is it possible to include checking .rd files in spell_check_files? Now, rd files are checked as plain text, so code objects are also checked for spelling.

I wonder if spell_check_file_one could have additional condition:

if (grepl(`\.rd$`, path, ignore.case = TRUE) {
  spell_check_file_rd(...)
}

Request for next release of {spelling}

Hello! I use the lifecycle package to document functions, and spelling gives a verbose warning that the lifecycle macro is not defined. This had been documented in #42 and fixed last October in pull request #44 . Would it be possible to make a release so we don't see the warnings when using the CRAN version of spelling? The issue comes up for me when work with staff inexperienced in package development and the warnings throw them off and they can't figure out what is going on.

Many Thanks!

spellcheck fails with unhelpful error

spell_check_setup()
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  PCDATA invalid Char value 8 [9]

this occurred within the R package: https://github.com/mobiodiv/mobsim

spell_check_blogdown

spell_check_package() now checks vignettes/pkgdown subfolder

In the quanteda package we have some non-English language vignettes but only in the vignettes/pkgdown folder. Previously (last week??), spell_check_package() did not check those files, but now it does, so we get a very long list of Chinese, Japanese, Spanish etc words that fail to match the English dictionary or WORDLIST.

Does not fail on Travis by the way, only locally.

Is there a way to exclude this subfolder from checking?

Enhancement request: sort output of spell_check_package() by file

A reasonable workflow is to open files one after another and fix misspellings. Thus it would be useful to have a parameter by_file (logical, default FALSE) to sort the output by files, not alphabetically by misspelled word.

(I can submit a patch if that seems useful.)

Spellcheck fails with uninformative error

> spelling::spell_check_package(pkg = ".", vignettes = TRUE, lang = "en_US")
Error in hunspell::dictionary(lang, add_words = sort(ignore)) : 
  unused argument (add_words = sort(ignore))

Use rstudio dictionaries

Supposed to have common R jargon:

Dictionaries: https://s3.amazonaws.com/rstudio-buildtools/dictionaries/core-dictionaries.zip

From the script here: https://github.com/rstudio/rstudio/blob/master/dependencies/common/install-dictionaries

Warnings when spell-checking a package with Rd macros

The lifecycle package does define the \lifecycle macro, but spelling warns that it is not defined.

git2r::clone(
  url = "https://github.com/r-lib/lifecycle",
  local_path = "lifecycle"
)
#> cloning into 'lifecycle'...
#> Receiving objects:   1% (9/880),   13 kb
#> Receiving objects:  11% (97/880),   30 kb
#> Receiving objects:  21% (185/880),   94 kb
#> Receiving objects:  31% (273/880),  111 kb
#> Receiving objects:  41% (361/880),  134 kb
#> Receiving objects:  51% (449/880),  150 kb
#> Receiving objects:  61% (537/880),  166 kb
#> Receiving objects:  71% (625/880),  166 kb
#> Receiving objects:  81% (713/880),  183 kb
#> Receiving objects:  91% (801/880),  191 kb
#> Receiving objects: 100% (880/880),  239 kb, done.
#> Local:    master /tmp/RtmpRWoEuz/reprex85671d073af/lifecycle
#> Remote:   master @ origin (https://github.com/r-lib/lifecycle)
#> Head:     [445f7f6] 2019-08-09: Add `is_present()` (#15)
spelling::spell_check_package("lifecycle")
#> DESCRIPTION does not contain 'Language' field. Defaulting to 'en-US'.
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:47: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:48: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:49: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:50: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:51: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:52: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:53: unknown macro
#> '\lifecycle'
#> Warning in parse_Rd(ifile, encoding = encoding, macros = macros): /tmp/
#> RtmpRWoEuz/reprex85671d073af/lifecycle/man/badge.Rd:54: unknown macro
#> '\lifecycle'
#>   WORD               FOUND IN
#> backtrace          last_warnings.Rd:14,22,23
#>                    NEWS.md:13
#> backtraces         NEWS.md:13
#>                    lifecycle.Rmd:155,198
#> behaviour          deprecate_soft.Rd:68
#>                    lifecycle.Rmd:38,40,42
#> Codecov            README.md:6
#> conjuction         lifecycle.Rmd:198
#> invokation         lifecycle.Rmd:85
#> programmatically   deprecate_soft.Rd:38
#> questining         README.md:27
#>                    lifecycle.Rmd:23
#> rlang's            NEWS.md:30
#> signalled          lifecycle-package.Rd:16
#>                    description:8
#> signaller          NEWS.md:18,24
#> summarised         lifecycle.Rmd:32
#> testthat           deprecate_soft.Rd:43
#>                    verbosity.Rd:11
#>                    lifecycle.Rmd:83
#> ther               lifecycle.Rmd:83

^{Created on 2019-08-16 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.0 (2019-04-26)
#>  os       Ubuntu 18.04.2 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-08-16                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                        
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                
#>  backports     1.1.4      2019-04-10 [1] CRAN (R 3.6.0)                
#>  callr         3.3.1      2019-07-18 [1] CRAN (R 3.6.0)                
#>  cli           1.1.0      2019-03-19 [1] CRAN (R 3.6.0)                
#>  commonmark    1.7        2018-12-01 [1] CRAN (R 3.6.0)                
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                
#>  devtools      2.1.0      2019-07-06 [1] CRAN (R 3.6.0)                
#>  digest        0.6.20     2019-07-04 [1] CRAN (R 3.6.0)                
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.0)                
#>  fs            1.3.1      2019-05-06 [1] CRAN (R 3.6.0)                
#>  git2r         0.26.1     2019-06-29 [1] CRAN (R 3.6.0)                
#>  glue          1.3.1      2019-03-12 [1] CRAN (R 3.6.0)                
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.6.0)                
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.6.0)                
#>  hunspell      3.0        2018-12-15 [1] CRAN (R 3.6.0)                
#>  knitr         1.24       2019-08-08 [1] CRAN (R 3.6.0)                
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.6.0)                
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.6.0)                
#>  pkgbuild      1.0.4      2019-08-05 [1] CRAN (R 3.6.0)                
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.6.0)                
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.6.0)                
#>  processx      3.4.1      2019-07-18 [1] CRAN (R 3.6.0)                
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.6.0)                
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.6.0)                
#>  Rcpp          1.0.2      2019-07-25 [1] CRAN (R 3.6.0)                
#>  remotes       2.1.0      2019-06-24 [1] CRAN (R 3.6.0)                
#>  rlang         0.4.0      2019-06-25 [1] CRAN (R 3.6.0)                
#>  rmarkdown     1.14       2019-07-12 [1] CRAN (R 3.6.0)                
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.6.0)                
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)                
#>  spelling      2.1        2019-03-11 [1] CRAN (R 3.6.0)                
#>  stringi       1.4.3      2019-03-12 [1] CRAN (R 3.6.0)                
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.6.0)                
#>  testthat      2.2.1      2019-07-25 [1] CRAN (R 3.6.0)                
#>  usethis       1.5.1.9000 2019-08-11 [1] Github (r-lib/usethis@b241420)
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.6.0)                
#>  xfun          0.8        2019-06-25 [1] CRAN (R 3.6.0)                
#>  xml2          1.2.2      2019-08-09 [1] CRAN (R 3.6.0)                
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.6.0)                
#> 
#> [1] /home/landau/R/R-3.6.0/library

Documentation of how to add a new dictionary

How can I download a new dictionary to use another language?

I'm happy to help writing the documentation for this (or even a helper function?).

Add support for using multiple dictionaries / languages

There is a growing interest in multilingual R packages, and this relates to recent rOpenSci work.

Would it be possible to add the ability to spellcheck a package with multiple languages? To address, e.g., the case when documentation or a README is provided in two languages (real-life example). I understand that this might then miss some typos because they are a valid word in the other languages.

From a quick glance, it looks like this might require a change to the hunspell (this change? ropensci/hunspell#37) so feel free to move there if you believe that's more appropriate.

parse_text_md not exported

A user has reported that parse_text_md() is documented but not exported.

Specify additional arbitrary package files to check

It would be great if there was a way to add a vector of additional (package root relative) file paths as a parameter to spell_check_package() and to spell_check_test(). It seems like this one general change would avoid a bunch of specific requests. It would also support rare use cases, including some specific issues I have :)

Support alphanumeric and hyphenated words

I am using the following words in my package:

RNA-seq
1st
2nd
EIF4G1

After inserting these words in inst/WORDLIST and running spelling::spell_check_package(), the function reports that the words seq, st, nd and EIF are misspelled.

Currently, my WORDLIST includes the words seq, st, nd and EIF to avoid triggering the spell checker, but I would prefer to include the full words. Thanks.

Declared encoding is not used in a package?

See current pkgdepends for an example:

❯ LANG=C R -q -e 'spelling::spell_check_package()'
> spelling::spell_check_package()
DESCRIPTION does not contain 'Language' field. Defaulting to 'en-US'.
  WORD   FOUND IN
NANA   pkgdepends-package.Rd:218

Enhancement request: spell_check_files()

It would be great if the output could be parametized so that as well as alphabetical there was a 'by line number' option
In my case, at least, this makes more sense working through a document

Error in read_xml.raw: Input is not proper UTF-8, indicate encoding !

Hi,

Running spelling::spell_check_test() fails on the crosstable package with the following error:

spelling::spell_check_package()
#>Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>  Input is not proper UTF-8, indicate encoding !
#>Bytes: 0x93 0x63 0x79 0x94 [9]

I have no clue where this error can come from and the error message is unfortunately not very informative.

Would it be possible to terminate early from spelling instead of xml2 so that the path is in the error message?
Of course, if we can also have the line and the specific bad character, it would be even better!

Note that in this case, UTF8 is the default encoding in the package's DESCRIPTION and in RStudio parameters. R CMD CHECK completes without error so I guess any encoding problem is not that severe, don't you think?

REPREX

Download this file and open it in RStudio. https://raw.githubusercontent.com/DanChaltiel/crosstable/dd561f3ef405f6621357912c53ab53a6299b99cd/README.md
There are non-UTF8 characters on rows 147 and 150
Try to spell_check() (I used devtools::spell_check())

EDIT

After more debugging, it seems to pertain to this line:

spelling/R/parse-markdown.R

Line 24 in 008417f

doc <- xml2::xml_ns_strip(xml2::read_xml(md))

In my case, it pointed to my README.md file which indeed contained special characters. I have no idea how they ended up there though, and they are far too numerous that I can correct it manually (a knitting problem from README.Rmd I guess).

EDIT2

Since this confusing problem is not that rare (#52, #58, #62), a fix might be found useful.

Here are some proposals:

simply use a tryCatch() on xml2::xml_ns_strip() so that we can add path in the error message
add a warning in the specific case of non-UTF8 characters:

  text <- readLines(path, warn = FALSE, encoding = "UTF-8")
  invalid = !validUTF8(text)
  if(any(invalid)){
    warning(message = c("The file ", path, " has non-UTF-8 characters on rows: ", paste(which(invalid), collapse=", ")))
  }

use this trick from xfun::read_utf8() to ignore the problem (spell_check_package() will have no error):

  opts = options(encoding = "native.enc")
  on.exit(options(opts), add = TRUE)
  text <- readLines(path, warn = FALSE, encoding = "UTF-8")

We can do the 3 at the same time. I can make a PR if needed.

Spell check does not for for two or more languages

I have a package in which the documentation in written in English (functions) and a second language (vignettes). These languages are set in the Language field of the DESCRIPTION file. This is done following the 'Writing R Extensions' manual:

A ‘Language’ field can be used to indicate if the package documentation is not in English: this should be a comma-separated list of standard (not private use or grandfathered) IETF language tags as currently defined by RFC 5646 (https://tools.ietf.org/html/rfc5646, see also https://en.wikipedia.org/wiki/IETF_language_tag), i.e., use language subtags which in essence are 2-letter ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) or 3-letter ISO 639-3 (https://en.wikipedia.org/wiki/ISO_639-3) language codes.

When running devtools::spell_check(), the following error message is issued:

> devtools::spell_check()
Error: Dictionary file not found: pt_BR, EN_US.dic

I haven't found a way around this.

Avoid spell check for `References` section of Roxygen2 function documentation

I have the following which triggers the spelling check errors:

#' @references
#' \itemize{
#'   \item Yan, Xin, and Xiao Gang Su. 2010. “Stratified Wilson and Newcombe Confidence Intervals for Multiple Binomial Proportions.” Statistics in Biopharmaceutical Research 2 (3): 329–35.
#' }

my solution at the moment is just to put backticks:

#' @references
#' \itemize{
#'   \item `Yan, Xin, and Xiao Gang Su. 2010. “Stratified Wilson and Newcombe Confidence Intervals for Multiple Binomial Proportions.” Statistics in Biopharmaceutical Research 2 (3): 329–35.`
#' }

Rmd files with LaTeX

Occasionally we have LaTeX in md/Rmd files - I know we shouldn't but sometimes it just happens ;)

I think adding a format = "latex" to

spelling/R/check-files.R

Line 104 in fc619ee

bad_words <- hunspell::hunspell(words$text, dict = dict)

should handle this case and not have any unwanted sided effects

Error in sub(dest, "", xml2::xml_text(node), fixed = TRUE) : zero-length pattern

I have a package where spelling has been running spell checks for a while with no problem.
I'm working on an update and I'm getting some strange spelling behaviour and I can't seem to find the origin of it.

It fails during builds:

> if(requireNamespace('spelling', quietly = TRUE))
+   spelling::spell_check_test(vignettes = TRUE,
+                              error = FALSE,
+                              skip_on_cran = TRUE)
Error in sub(dest, "", xml2::xml_text(node), fixed = TRUE) : 
  zero-length pattern
Calls: <Anonymous> ... lapply -> FUN -> <Anonymous> -> xml_text<-.xml_node -> sub
Execution halted

But runs just fine interactively, returning NULL.

I thought maybe there was an issue in my wordlist, so I deleted it, but same issue persists.
Lastly, I thought i'd re-initiate it in case there was some old-timey setup issue. But even setting it up causes the error.
Tried the same with the full path but got the same error.

> spelling::spell_check_setup(".")
Error in sub(dest, "", xml2::xml_text(node), fixed = TRUE) : 
  zero-length pattern

I belive this code is where its erroring:

spelling/R/parse-markdown.R

Line 31 in 008417f

xml2::xml_set_text(node, sub(dest, '', xml2::xml_text(node), fixed = TRUE))

Though, unsure how/why or why it suddenly started complaining when it did not before.

Suggestions welcome

"PCDATA invalid Char value" error

If I try to spell check this file, I get the following error:

> spelling:::spell_check_file_md("README.md")
Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  PCDATA invalid Char value 27 [9]

I tried to debug this further, but couldn't manage to find the offending text. If it's of any help, this is traceback I see:

> traceback()
7: read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, 
       options = options)
6: read_xml.character(md)
5: xml2::read_xml(md)
4: xml_find_all(x, "//namespace::*[name()='']/parent::*")
3: xml2::xml_ns_strip(xml2::read_xml(md))
2: parse_text_md(path)
1: spelling:::spell_check_file_md("README.md")

Also, here is my session information:

Session info

sessioninfo::session_info()
#> - Session info  --------------------------------------------------------------
#>  hash: women holding hands: dark skin tone, open hands: light skin tone, thermometer
#> 
#>  setting  value
#>  version  R version 4.1.1 (2021-08-10)
#>  os       Windows 10 x64 (build 19043)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United Kingdom.1252
#>  ctype    English_United Kingdom.1252
#>  tz       Europe/Berlin
#>  date     2021-10-09
#>  pandoc   2.14.2 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date (UTC) lib source
#>  backports     1.2.1      2020-12-09 [1] CRAN (R 4.1.0)
#>  cli           3.0.1      2021-07-17 [1] CRAN (R 4.1.0)
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.1.1)
#>  digest        0.6.28     2021-09-23 [1] CRAN (R 4.1.1)
#>  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.1.1)
#>  fansi         0.5.0      2021-05-25 [1] CRAN (R 4.1.1)
#>  fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.1)
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.1.1)
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.1.1)
#>  highr         0.9        2021-04-16 [1] CRAN (R 4.1.1)
#>  htmltools     0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
#>  knitr         1.36.3     2021-10-09 [1] Github (yihui/knitr@00469e0)
#>  lifecycle     1.0.1      2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.1.1)
#>  pillar        1.6.3      2021-09-26 [1] CRAN (R 4.1.1)
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.1)
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.1)
#>  R.cache       0.15.0     2021-04-30 [1] CRAN (R 4.1.1)
#>  R.methodsS3   1.8.1      2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0     2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0     2021-09-26 [1] CRAN (R 4.1.1)
#>  reprex        2.0.1      2021-08-05 [1] CRAN (R 4.1.1)
#>  rlang         0.4.11     2021-04-30 [1] CRAN (R 4.1.1)
#>  rmarkdown     2.11.3     2021-10-09 [1] Github (rstudio/rmarkdown@5a3e941)
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.1.1)
#>  sessioninfo   1.1.1.9000 2021-10-09 [1] Github (r-lib/sessioninfo@1ff2194)
#>  spelling    * 2.2        2020-10-18 [1] CRAN (R 4.1.1)
#>  stringi       1.7.5      2021-10-04 [1] CRAN (R 4.1.1)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.1.1)
#>  styler        1.6.2.9000 2021-10-08 [1] Github (r-lib/styler@7c46e20)
#>  tibble        3.1.5      2021-09-30 [1] CRAN (R 4.1.1)
#>  utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.1)
#>  vctrs         0.3.8      2021-04-29 [1] CRAN (R 4.1.1)
#>  withr         2.4.2      2021-04-18 [1] CRAN (R 4.1.1)
#>  xfun          0.26       2021-09-14 [1] CRAN (R 4.1.1)
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.1.0)
#> 
#>  [1] C:/Users/IndrajeetPatil/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.1/library

Sort WORDLIST in a locale-independent way

Currently, the word order in WORDLIST is locale-dependent, which can create large spurious diffs when multiple people contribute to the package but use different locales.

I see two solutions:

use method = "radix" in sort(). It is to my knowledge the only locale independent sorting method
temporarily set a collating locale:

orig_locale <- Sys.getlocale("LC_COLLATE")
on.exit(Sys.setlocale("LC_COLLATE", orig_locale))
Sys.setlocale("LC_COLLATE", "C")

The nice thing about the second option is that you can set the locale to the one specified in DESCRIPTION.

Please let me know if you'd like me to submit a PR for this.