nacnudus / tidyxl Goto Github PK

View Code? Open in Web Editor NEW

246.0 12.0 20.0 9 MB

Read untidy Excel files in R https://nacnudus.github.io/tidyxl/

Home Page: https://nacnudus.github.io/tidyxl/

License: Other

R 25.59% C++ 74.36% C 0.05%

r spreadsheet excelreader tidy rcpp

tidyxl's Introduction

tidyxl

tidyxl imports non-tabular data from Excel files into R. It exposes cell content, position, formatting and comments in a tidy structure for further manipulation, especially by the unpivotr package. It supports the xml-based file formats ‘.xlsx’ and ‘.xlsm’ via the embedded RapidXML C++ library. It does not support the binary file formats ‘.xlsb’ or ‘.xls’.

It also provides a function xlex() for tokenizing formulas. See the vignette for details. It is useful for detecting ‘spreadsheet smells’ (poor practice such as embedding constants in formulas, or using deep levels of nesting), and for understanding the dependency structures within spreadsheets.

Make cells tidy

Tidyxl represents each cell as one row in a dataframe. Like this.

What can you do with tidy cells? The best places to start are:

Spreadsheet Munging Strategies, a free, online cookbook using tidyxl and unpivotr
Screencasts on YouTube.
Worked examples on GitHub.

Otherwise see examples below.

Breaking changes

The version 1.0.0 release has some breaking changes. See NEWS.md for details. The previous version can be installed as follows.

devtools::install_version("tidyxl", version = "0.2.3", repos = "http://cran.us.r-project.org")

Installation

devtools::install_github("nacnudus/tidyxl")

Examples

The package includes a spreadsheet, ‘titanic.xlsx’, which contains the following pivot table:

ftable(Titanic, row.vars = 1:2)
#>              Age      Child     Adult
#>              Survived    No Yes    No Yes
#> Class Sex
#> 1st   Male                0   5   118  57
#>       Female              0   1     4 140
#> 2nd   Male                0  11   154  14
#>       Female              0  13    13  80
#> 3rd   Male               35  13   387  75
#>       Female             17  14    89  76
#> Crew  Male                0   0   670 192
#>       Female              0   0     3  20

The multi-row column headers make this difficult to import. A popular package for importing spreadsheets coerces the pivot table into a dataframe. It treats the second header row as though it were observations.

titanic <- system.file("extdata/titanic.xlsx", package = "tidyxl")
readxl::read_excel(titanic)
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...5`
#> • `` -> `...7`
#> # A tibble: 10 × 7
#>    ...1  ...2   Age      Child ...5  Adult ...7
#>    <chr> <chr>  <chr>    <chr> <chr> <chr> <chr>
#>  1 <NA>  <NA>   Survived No    Yes   No    Yes
#>  2 Class Sex    <NA>     <NA>  <NA>  <NA>  <NA>
#>  3 1st   Male   <NA>     0     5     118   57
#>  4 <NA>  Female <NA>     0     1     4     140
#>  5 2nd   Male   <NA>     0     11    154   14
#>  6 <NA>  Female <NA>     0     13    13    80
#>  7 3rd   Male   <NA>     35    13    387   75
#>  8 <NA>  Female <NA>     17    14    89    76
#>  9 Crew  Male   <NA>     0     0     670   192
#> 10 <NA>  Female <NA>     0     0     3     20

tidyxl doesn’t coerce the pivot table into a data frame. Instead, it represents each cell in its own row, where it describes the cell’s address, value and other properties.

library(tidyxl)
x <- xlsx_cells(titanic)
dplyr::glimpse(x)
#> Rows: 60
#> Columns: 24
#> $ sheet               <chr> "Sheet1", "Sheet1", "Sheet1", "Sheet1", "Sheet1", …
#> $ address             <chr> "C1", "D1", "E1", "F1", "G1", "C2", "D2", "E2", "F…
#> $ row                 <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4,…
#> $ col                 <int> 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 1, 2, 1, 2, 4, 5, 6,…
#> $ is_blank            <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FAL…
#> $ content             <chr> "0", "1", NA, "2", NA, "3", "4", "5", "4", "5", "6…
#> $ data_type           <chr> "character", "character", "blank", "character", "b…
#> $ error               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ logical             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ numeric             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ date                <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ character           <chr> "Age", "Child", NA, "Adult", NA, "Survived", "No",…
#> $ character_formatted <list> [<tbl_df[1 x 14]>], [<tbl_df[1 x 14]>], <NULL>, […
#> $ formula             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ is_array            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ formula_ref         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ formula_group       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ comment             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ height              <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15…
#> $ width               <dbl> 8.38, 8.38, 8.38, 8.38, 8.38, 8.38, 8.38, 8.38, 8.…
#> $ row_outline_level   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ col_outline_level   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ style_format        <chr> "Normal", "Normal", "Normal", "Normal", "Normal", …
#> $ local_format_id     <int> 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 1, 1, 1,…

In this structure, the cells can be found by filtering.

x[x$data_type == "character", c("address", "character")]
#> # A tibble: 22 × 2
#>    address character
#>    <chr>   <chr>
#>  1 C1      Age
#>  2 D1      Child
#>  3 F1      Adult
#>  4 C2      Survived
#>  5 D2      No
#>  6 E2      Yes
#>  7 F2      No
#>  8 G2      Yes
#>  9 A3      Class
#> 10 B3      Sex
#> # … with 12 more rows
x[x$row == 4, c("address", "character", "numeric")]
#> # A tibble: 6 × 3
#>   address character numeric
#>   <chr>   <chr>       <dbl>
#> 1 A4      1st            NA
#> 2 B4      Male           NA
#> 3 D4      <NA>            0
#> 4 E4      <NA>            5
#> 5 F4      <NA>          118
#> 6 G4      <NA>           57

Specific sheets can be requested using xlsx_cells(file, sheet), and the names of all sheets in a file are given by xlsx_sheet_names().

Formatting

The original spreadsheet has formatting applied to the cells. This can also be retrieved using tidyxl, with the xlsx_formats() function.

Formatting is available by using the columns local_format_id and style_format as indexes into a separate list-of-lists structure. ‘Local’ formatting is the most common kind, applied to individual cells. ‘Style’ formatting is usually applied to blocks of cells, and defines several formats at once. Here is a screenshot of the styles buttons in Excel.

Formatting can be looked up as follows.

# Bold
formats <- xlsx_formats(titanic)
formats$local$font$bold
#> [1] FALSE  TRUE FALSE FALSE
x[x$local_format_id %in% which(formats$local$font$bold),
  c("address", "character")]
#> # A tibble: 4 × 2
#>   address character
#>   <chr>   <chr>
#> 1 C1      Age
#> 2 C2      Survived
#> 3 A3      Class
#> 4 B3      Sex

# Yellow fill
formats$local$fill$patternFill$fgColor$rgb
#> [1] NA         NA         NA         "FFFFFF00"
x[x$local_format_id %in%
  which(formats$local$fill$patternFill$fgColor$rgb == "FFFFFF00"),
  c("address", "numeric")]
#> # A tibble: 2 × 2
#>   address numeric
#>   <chr>     <dbl>
#> 1 F11           3
#> 2 G11          20

# Styles by name
formats$style$font$name["Normal"]
#>    Normal
#> "Calibri"
head(x[x$style_format == "Normal", c("address", "character")])
#> # A tibble: 6 × 2
#>   address character
#>   <chr>   <chr>
#> 1 C1      Age
#> 2 D1      Child
#> 3 E1      <NA>
#> 4 F1      Adult
#> 5 G1      <NA>
#> 6 C2      Survived

# In-cell formatting is available in the `character_formatted` column as a data
# frame, one row per substring.
examples <- system.file("/extdata/examples.xlsx", package = "tidyxl")
xlsx_cells(examples)$character_formatted[77]
#> [[1]]
#> # A tibble: 16 × 14
#>    character   bold  italic under…¹ strike vertA…²  size color…³ color…⁴ color…⁵
#>    <chr>       <lgl> <lgl>  <chr>   <lgl>  <chr>   <dbl> <chr>     <int>   <int>
#>  1 "in-cell"   FALSE FALSE  <NA>    FALSE  <NA>        0 <NA>         NA      NA
#>  2 "bold"      TRUE  FALSE  <NA>    FALSE  <NA>        0 FF0000…       2      NA
#>  3 "italic"    FALSE TRUE   <NA>    FALSE  <NA>        0 FF0000…       2      NA
#>  4 "underline" FALSE FALSE  single  FALSE  <NA>        0 FF0000…       2      NA
#>  5 "underline… FALSE FALSE  double  FALSE  <NA>        0 FF0000…       2      NA
#>  6 "singleacc… FALSE FALSE  single… FALSE  <NA>        0 FF0000…       2      NA
#>  7 "doubleacc… FALSE FALSE  double… FALSE  <NA>        0 FF0000…       2      NA
#>  8 "strikethr… FALSE FALSE  <NA>    TRUE   <NA>        0 FF0000…       2      NA
#>  9 "subscript" FALSE FALSE  <NA>    FALSE  subscr…     0 FF0000…       2      NA
#> 10 "superscri… FALSE FALSE  <NA>    FALSE  supers…     0 FF0000…       2      NA
#> 11 "red"       FALSE FALSE  <NA>    FALSE  <NA>        0 FFFF00…      NA      NA
#> 12 "theme"     FALSE FALSE  <NA>    FALSE  <NA>        0 FFC050…       6      NA
#> 13 "tint"      FALSE FALSE  <NA>    FALSE  <NA>        0 FFC050…       6      NA
#> 14 "size"      FALSE FALSE  <NA>    FALSE  <NA>        0 FFFF00…      NA      NA
#> 15 "arial"     FALSE FALSE  <NA>    FALSE  <NA>        0 FFFF00…      NA      NA
#> 16 "UTF8Stéph… FALSE FALSE  <NA>    FALSE  <NA>        0 FFFF00…      NA      NA
#> # … with 4 more variables: color_tint <dbl>, font <chr>, family <int>,
#> #   scheme <chr>, and abbreviated variable names ¹underline, ²vertAlign,
#> #   ³color_rgb, ⁴color_theme, ⁵color_indexed

To see all the available kinds of formats, use str(formats).

Comments

Comments are available alongside cell values.

x[!is.na(x$comment), c("address", "comment")]
#> # A tibble: 1 × 2
#>   address comment
#>   <chr>   <chr>
#> 1 G11     All women in the crew worked in the victualling department.

Formulas

Formulas are available, but with a few quirks.

options(width = 120)
y <- xlsx_cells(examples, "Sheet1")
y[!is.na(y$formula),
  c("address", "formula", "is_array", "formula_ref", "formula_group",
    "error", "logical", "numeric", "date", "character")]
#> # A tibble: 32 × 10
#>    address formula             is_array formula_ref formula_group error   logical numeric date                character
#>    <chr>   <chr>               <lgl>    <chr>               <int> <chr>   <lgl>     <dbl> <dttm>              <chr>
#>  1 A1      "1/0"               FALSE    <NA>                   NA #DIV/0! NA           NA NA                  <NA>
#>  2 A14     "1=1"               FALSE    <NA>                   NA <NA>    TRUE         NA NA                  <NA>
#>  3 A15     "A4+1"              FALSE    <NA>                   NA <NA>    NA         1338 NA                  <NA>
#>  4 A16     "DATE(2017,1,18)"   FALSE    <NA>                   NA <NA>    NA           NA 2017-01-18 00:00:00 <NA>
#>  5 A17     "\"Hello, World!\"" FALSE    <NA>                   NA <NA>    NA           NA NA                  Hello, Wo…
#>  6 A19     "$A$18+1"           FALSE    <NA>                   NA <NA>    NA            2 NA                  <NA>
#>  7 B19     "A18+2"             FALSE    <NA>                   NA <NA>    NA            3 NA                  <NA>
#>  8 A20     "$A$18+1"           FALSE    A20:A21                 0 <NA>    NA            2 NA                  <NA>
#>  9 B20     "A19+2"             FALSE    B20:B21                 1 <NA>    NA            4 NA                  <NA>
#> 10 A21     "$A$18+1"           FALSE    <NA>                    0 <NA>    NA            2 NA                  <NA>
#> # … with 22 more rows

The top five cells show that the results of formulas are available as usual in the columns error, logical, numeric, date, and character.

Shared formulas

Cells A20 and A21 illustrate how formulas are normalised before being written to file, using the formula_ref and formula_group columns. When there is a group of cells whose formulas only differ by cell reference (e.g. “=A1+1”, “=A2+1”, “=A3+1”, etc.), only one formula in each group is written to the file, so tidyxl infers what the formulas in the other cells in the group must be, from their relative positions.

Array formulas

There are two kinds of array formulas: ones that compute over arrays, and ones whose output is an array (of cells).

Both kinds are distinguished in spreadsheet programs by curly braces, e.g. {SUM(A19:A21*B19:B21)}. In tidyxl, the curly braces are ommitted (as they are from the file itself), and instead the is_array column has the value TRUE.

The first kind (those that compute over arrays) is illustrated by cell A22.

The second kind (those whose value is spread across an array of cells) is illustrated by cells A23 and A24. The formula is only given in the top-left cell (A23), which is also the only cell that describes the range of cells containing the result, in the formula-ref column. The results themselves are stored in all relevant cells (A23 and A24). Unlike shared formulas, there is no formula_group to associate the cells of an array formula’s result. If you need to do identify those cells, use the cellranger package and the formula_ref column.

Formulas referring to other files

Cell A25 contains a formula that refers to another file. The [1] is an index into a table of files. The roadmap for tidyxl includes de-referencing such numbers.

Tokenizing formulas

The function xlex() separates formulas into tokens of different types, and gives their depth within a nested formula. Its name is a bad pun on ‘Excel’ and ‘lexer’. Try the online demo, or install the more experimental lexl package to run demo_lexl() locally.

It is useful for detecting spreadsheet smells, which are poor practices in spreadsheet design, such as deep nests of functions, or embedding constants in formulas.

x <- xlex("MIN(3,MAX(2,A1))")
x
#> root
#> ¦-- MIN         function
#> °-- (           fun_open
#>     ¦-- 3       number
#>     ¦-- ,       separator
#>     ¦-- MAX     function
#>     °-- (       fun_open
#>         ¦-- 2   number
#>         ¦-- ,   separator
#>         °-- A1  ref
#>     °-- )       fun_close
#> °-- )           fun_close

See the vignette for more examples and details.

Named ranges

Names are imported with xlsx_names(). Also called ‘named formulas’ and ‘defined names’, these are usually used to name particular cells or ranges, making formulas that refer to them more readable. Ones that are ranges are identifed by the is_range column (using is_range()), making it easier to match the names to the cells returned by xlsx_cells() – e.g. by using the cellranger package.

When the scope of the name is within a particular sheet, rather than global, the sheet name is given.

xlsx_names(examples)
#>   rId    sheet                 name                     formula    comment hidden is_range
#> 1   1   Sheet1  named_local_formula MAX(Sheet1!$A$129:$A$130)+1       <NA>  FALSE    FALSE
#> 2   4 E09904.2   sheet_beyond_chart E09904.2!$A$1,E09904.2!$C$1       <NA>  FALSE     TRUE
#> 3  NA     <NA>         intersection   Sheet1!$B:$B Sheet1!$8:$8       <NA>  FALSE     TRUE
#> 4  NA     <NA> named_global_formula             Sheet1!$A$129-1       <NA>  FALSE    FALSE
#> 5  NA     <NA>          named_range               Sheet1!$A$129 My comment  FALSE     TRUE

Data validation rules

Data validation rules are imported with xlsx_validation(). These rules control what values may be entered into a cell, and are often used to create a drop-down list in a cell. Read the vignette for details.

xlsx_validation(examples)
#> # A tibble: 15 × 14
#>    sheet  ref            type    opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷ show_…⁸ error…⁹ error…˟ error…˟
#>    <chr>  <chr>          <chr>   <chr>   <chr>   <chr>   <lgl>   <lgl>   <chr>   <chr>   <lgl>   <chr>   <chr>   <chr>
#>  1 Sheet1 A106           whole   between 0       9       TRUE    TRUE    messag… messag… TRUE    error … error … stop
#>  2 Sheet1 A108           list    <NA>    $B$108  <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    warning
#>  3 Sheet1 A110           date    between 2017-0… 2017-0… TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#>  4 Sheet1 A111           time    between 00:00:… 09:00:… TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#>  5 Sheet1 A112           textLe… between 0       9       TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#>  6 Sheet1 A114           whole   notBet… 0       9       TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#>  7 Sheet1 A115,A121:A122 whole   equal   0       <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#>  8 Sheet1 A116           whole   notEqu… 0       <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#>  9 Sheet1 A117           whole   greate… 0       <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#> 10 Sheet1 A119           whole   greate… 0       <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#> 11 Sheet1 A120           whole   lessTh… 0       <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#> 12 Sheet1 A118           whole   lessTh… 0       <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#> 13 Sheet1 A107           decimal notBet… 0       9       FALSE   FALSE   <NA>    <NA>    FALSE   <NA>    <NA>    stop
#> 14 Sheet1 A113           custom  <NA>    A113<=… <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    stop
#> 15 Sheet1 A109           list    <NA>    $B$108  <NA>    TRUE    TRUE    <NA>    <NA>    TRUE    <NA>    <NA>    inform…
#> # … with abbreviated variable names ¹operator, ²formula1, ³formula2, ⁴allow_blank, ⁵show_input_message, ⁶prompt_title,
#> #   ⁷prompt_body, ⁸show_error_message, ⁹error_title, ˟error_body, ˟error_symbol

Philosophy

Information in in many spreadsheets cannot be easily imported into R. Why?

Most R packages that import spreadsheets have difficulty unless the layout of the spreadsheet conforms to a strict definition of a ‘table’, e.g.:

observations in rows
variables in columns
a single header row
all information represented by characters, whether textual, logical, or numeric

These rules are designed to eliminate ambiguity in the interpretation of the information. But most spreadsheeting software relaxes these rules in a trade of ambiguity for expression via other media:

proximity (other than headers, i.e. other than being the first value at the top of a column)
formatting (colours and borders)

Humans can usually resolve the ambiguities with contextual knowledge, but computers are limited by their ignorance. Programmers are hampered by:

their language’s expressiveness
loss of information in transfer from spreadsheet to programming library

Information is lost when software discards it in order to force the data into tabular form. Sometimes date formatting is retained, but mostly formatting is lost, and position has to be inferred again.

tidyxl addresses the programmer’s problems by not discarding information. It imports the content, position and formatting of cells, leaving it up to the user to associate the different forms of information, and to re-encode them in tabular form without loss. The unpivotr package has been developed to assist with that step.

Similar projects

tidyxl was originally derived from readxl and still contains some of the same code, hence it inherits the GPL-3 licence. readxl is intended for importing tabular data with a single row of column headers, whereas tidyxl is more general, and less magic.

The rsheets project of several R packages is in the early stages of importing spreadsheet information from Excel and Google Sheets into R, manipulating it, and potentially parsing and processing formulas and writing out to spreadsheet files. In particular, jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms.

tidyxl differs from rsheets in scope (tidyxl will never import charts, for example), and implementation (tidyxl is implemented mainly in C++ and is quite fast, only a little slower than readxl). unpivotr is a package related to tidyxl that provides tools for unpivoting complex and non-tabular data layouts using I not AI (intelligence, not artificial intelligence). In this way it corresponds to jailbreaker, but with a different philosophy.

Mailing list

For bugs and/or issues, create a new issue on GitHub For other questions or comments, please subscribe to the tidyxl-devel mailing list. You must be a member to post messages, but anyone can read the archived discussions.

tidyxl's People

Contributors

Stargazers

Watchers

Forkers

rlugojr xtmgah applied-statistic-using-r cstahl12 eleakin ktaranov baifengbai skols minghao2016 bedantaguru romainfrancois da11an rnaimehaom jmcrav lextuga007 zonylaw jussi-kupari gregrs-uk

tidyxl's Issues

typo in doc

"Names are imported with xlex_names()" should be Names are imported with xlsx_names()

Account for col- and row-level formatting

ECMA p.1678.

Rows have a customFormat attribute, and if this is 1 then apply their s attribute (style index), except where individual cells override it.

Columns have a style attribute and nothing else, but it seems to work in the same way.

Separate a column into new columns based on the number of leading spaces

Feature request, relevant StackOverflow post:

https://stackoverflow.com/q/51757816/680068

Incorrect dates with numfmt h:mm AM/PM

Hello,

The xlsx file is the same as tidyverse/readxl#264

> y <- tidy_xlsx("test02.xlsx", sheet=1)
> y$data$Sheet1
  address row col content formula formula_type formula_ref formula_group
1      A1   1   1       0    <NA>         <NA>        <NA>            NA
2      A2   2   1       1    <NA>         <NA>        <NA>            NA
3      A3   3   1     1.5    <NA>         <NA>        <NA>            NA
4      A4   4   1     1.8    <NA>         <NA>        <NA>            NA
  type data_type error logical numeric                date character
1    s character  <NA>      NA      NA                <NA>         A
2    n      date  <NA>      NA      NA 1904-01-02 00:00:00      <NA>
3    n      date  <NA>      NA      NA 1904-01-02 12:00:00      <NA>
4    n      date  <NA>      NA      NA 1904-01-02 19:12:00      <NA>
  comment height width style_format local_format_id
1    <NA>     15  8.38       Normal               1
2    <NA>     15  8.38       Normal               2
3    <NA>     15  8.38       Normal               2
4    <NA>     15  8.38       Normal               2

Can you make R Makevars specify gcc? ipcc fails.

Hello,

I noticed that the Intel compiler barfs on your package when trying to link with piton.

icpc -std=gnu++11 -I"/opt/R/R-4.0.2-mkl/lib64/R/include" -DNDEBUG  -I'/home/me/R/lib/pro/4.0/Rcpp/include' -I'/home/me/R/lib/pro/4.0/piton/include' -I/usr/local/include   -fpic  -g -O2  -c xlex.cpp -o xlex.o
token_grammar.h(107): error: "seq" is ambiguous
                                     seq< plus< D >, opt< dot, plus< D > > > > {};
                                     ^
          detected during:
            instantiation of class "xltoken::decimal<D> [with D=tao::pegtl::ascii::digit]" at line 35 of "/home/me/R/lib/pro/4.0/piton/include/pegtl/internal/duseltronik.hpp"

It seems to build fine with gcc.

Change licence to MIT

Tidyxl was licenced under GPL-3 for compatibility with certain fragments of code, cribbed from readxl, that were written at RStudio by Hadley Wickham and Jenny Bryan. That code has recently be re-licensed under MIT, so with their approval I would like to re-license tidyxl too.

@hadley and @jennybc , would you permit me to re-license tidyxl with the MIT license? If so, please comment "I agree" below.

Unable to extract conditional formatting using xlsx_formats

I am unable to extract conditional formatting from an xlsx file using xlsx_formats.

The goal is to continuously rebuild a xlsx file using openxslx, but I want to maintain the highlighting from the imported xlsx file. I am able to extract the formatting from the original file and apply the highlighting to a new file via conditional formatting. This is because the original file is not conditionally formatted, but rather manually formatted. However, when I want to build a new file using the conditionally formatted file, the formatting is no longer detected using xslx_formats.

dataValidations not detected in Office 365 files

xlsx_validation() doesn't seem to pick up data validation rules when the list of allowed values is on a different sheet. This (test_validation.xlsx) test file generates zero detected data validation rules, even though one is present.

Tested syntax

tidyxl::xlsx_validation(here::here("data/test_validation.xlsx"))

and

tidyxl::xlsx_validation(here::here("data/test_validation.xlsx"), sheets = c(1, 2))

Formatting of alignment and protection is not returned

Crashes when opening a Google Docs exported spreadsheet

hadley/readxl/tests/testthat/iris-google-doc.xlsxe

Import hyperlinks

ECMA page 1641

<hyperlinks>
 <hyperlink ref="A11" r:id="rId1" tooltip="Search Page"/>
</hyperlinks>

_rels/sheet1.xml.rels

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship
    Id="rId26"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink"
    Target="http://comicbooks.about.com/od/spawn/p/spawn.htm"
    TargetMode="External"/>

Fix the awful colour/date decision hack

See tidyverse/readxl#388.

Encoding issue

See #9 (comment) about the accented letter. "Stéphane" in a comment or a cell gives "StÃ©phane" in R.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_Belgium.1252  LC_CTYPE=French_Belgium.1252   
[3] LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C                   
[5] LC_TIME=French_Belgium.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xml2_1.1.1     tidyxl_0.2.1   openxlsx_4.0.0

loaded via a namespace (and not attached):
[1] tools_3.3.1  Rcpp_0.12.9  readxl_0.1.1

Encoding issue when using cell references on Windows 10

Hi,

I'm having a problem with the encoding (german umlaute) on Windows 10 when using a cell reference.
The test file contains two sheets, where "Sheet2" has references to "Sheet1".
On Windows the conversion works for "Sheet1" but fails for "Sheet2".
This problem does not occur when using Ubuntu.

Thanks for any tip and for this great package!!
Patrick

On Windows it fails

library(tidyxl)
#> Warning: Paket 'tidyxl' wurde unter R Version 3.6.3 erstellt

s1 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet1')
s2 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet2_with_cell_ref_to_Sheet1')

s1[, c("sheet", "address", "character")]
#>    sheet address                            character
#> 1 Sheet1      B3                                    A
#> 2 Sheet1      B4             Einführung oder Änderung
#> 3 Sheet1      B5                            Kapazität
#> 4 Sheet1      B6                          Bauüberhang
#> 5 Sheet1      B7 (Zeilenumbrüche mittels Alt + Enter)
s2[, c("sheet", "address", "character")]
#>                            sheet address                             character
#> 1 Sheet2_with_cell_ref_to_Sheet1      B4            EinfÃ¼hrung oder Ã„nderung
#> 2 Sheet2_with_cell_ref_to_Sheet1      B5                            KapazitÃ¤t
#> 3 Sheet2_with_cell_ref_to_Sheet1      B6                          BauÃ¼berhang
#> 4 Sheet2_with_cell_ref_to_Sheet1      B7 (ZeilenumbrÃ¼che mittels Alt + Enter)
#> 5 Sheet2_with_cell_ref_to_Sheet1      B9                               ZuzÃ¼ge

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyxl_1.0.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.5.0
#>  [5] yaml_2.2.1      Rcpp_1.0.5      stringi_1.5.3   rmarkdown_2.5  
#>  [9] highr_0.8       knitr_1.30      stringr_1.4.0   xfun_0.19      
#> [13] digest_0.6.27   rlang_0.4.8     evaluate_0.14

On Ubuntu it works

library(tidyxl)

s1 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet1')
s2 <- tidyxl::xlsx_cells("german_umlaute_utf8.xlsx", sheets = 'Sheet2_with_cell_ref_to_Sheet1')

s1[, c("sheet", "address", "character")]
#> # A tibble: 5 x 3
#>   sheet  address character                           
#>   <chr>  <chr>   <chr>                               
#> 1 Sheet1 B3      A                                   
#> 2 Sheet1 B4      Einführung oder Änderung            
#> 3 Sheet1 B5      Kapazität                           
#> 4 Sheet1 B6      Bauüberhang                         
#> 5 Sheet1 B7      (Zeilenumbrüche mittels Alt + Enter)
s2[, c("sheet", "address", "character")]
#> # A tibble: 5 x 3
#>   sheet                          address character                           
#>   <chr>                          <chr>   <chr>                               
#> 1 Sheet2_with_cell_ref_to_Sheet1 B4      Einführung oder Änderung            
#> 2 Sheet2_with_cell_ref_to_Sheet1 B5      Kapazität                           
#> 3 Sheet2_with_cell_ref_to_Sheet1 B6      Bauüberhang                         
#> 4 Sheet2_with_cell_ref_to_Sheet1 B7      (Zeilenumbrüche mittels Alt + Enter)
#> 5 Sheet2_with_cell_ref_to_Sheet1 B9      Zuzüge


sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
#>  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
#>  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyxl_1.0.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5           rstudioapi_0.13      knitr_1.30          
#>  [4] magrittr_2.0.1       rlang_0.4.8          fansi_0.4.1         
#>  [7] stringr_1.4.0        styler_1.3.2         highr_0.8           
#> [10] tools_3.6.3          xfun_0.19            utf8_1.1.4          
#> [13] cli_2.2.0            htmltools_0.5.0.9002 ellipsis_0.3.1      
#> [16] assertthat_0.2.1     yaml_2.2.1           digest_0.6.27       
#> [19] tibble_3.0.4         lifecycle_0.2.0.9000 crayon_1.3.4        
#> [22] purrr_0.3.4          vctrs_0.3.5          fs_1.5.0            
#> [25] glue_1.4.2           evaluate_0.14        rmarkdown_2.5       
#> [28] reprex_0.3.0.9001    stringi_1.5.3        compiler_3.6.3      
#> [31] pillar_1.4.7         backports_1.2.0      pkgconfig_2.0.3

^{Created on 2020-11-24 by the reprex package (v0.3.0)}

Windows and Excel version

Edition	Windows 10 Pro
Version	20H2
Betriebssystembuild	19042.630

Microsoft Excel 2016 (16.0.5083.100)

Test file 'german_umlaute_utf8.xlsx'

german_umlaute_utf8.xlsx

Optionally apply merged cells across their range

E.g. if a cell with value "foo" is merged across A1:A2 then cell A2 should be created/overridden as a clone.

E and e exponent symbols are wrongly interpreted as dates

E.g. 0E1.

Reported by @da11an (thanks!) in #24 (comment) and #51

Read in openxlsx workbooks with xlsx_cells()

Hello! Thank y'all so much for this excellent package! I have been using openxlsx heavily to create a complicatedly-formatted Excel document. I can do most everything I need with data frames and openxlsx, except I would like to easily reference the formatting of a particular cell & use that information to format a different cell. To do so I would need a decent reference table, and xlsx_cells() creates an excellent one! However, xlsx_cells() cannot read in the wb (Workbook) object created by openxlsx.

Would it be possible for xlsx_cells() and/or tidyxl in general to read in openxlsx workbook objects?
They are essentially Excel document structures in R, so I would imagine it's possible, but I am still fairly new at programming and do not know how complicated that task would be.

That is a feature request -- at the same time, if anyone knows of a solution to my general problem in openxlsx, I would not mind any thoughts! (At the moment I'm trying to deconstruct your functions to create that xlsx_cells() table myself....)

Thank you!

special characters in sheet names

Hi Duncan,
I was working with some xlsx files that had sheet names in Spanish and I noticed that the names with special characters were getting garbled in the 'sheet' variable of the output data frame. I'm assuming it's an encoding issue. Is this something that could be specified as an argument during the file import?

thanks for the package!

Google sheets exports don't always have theme1.xml or certain elements in styles.xml

The proper way to handle this is not to expect any particular elements. That would require major recoding, so I'll address it when I port formats() to C++.

Custom number format misinterpreted as date?

#,##0_M_M_M;#,##0_M_M_M;_(@_M_M_M is being interpreted as date. Excel doesn't render the numbers as dates.

Sheet "Table 11" of 2015.xlsx.

tidyxl::is_date_format("#,##0_M_M_M;#,##0_M_M_M;_(@_M_M_M")
# [1] TRUE

option to get "data" only

Hello,

In general, I personally use only the field data of the output of tidy_xlsx. It would be nice to have an option to read this field only, if it allows to gain speed.

Comments on empty cells aren't returned

Return complex in-cell string formatting

When a cell contains a string, parts of which are formatted differently from other parts, the formatting is expressed within the tag of sharedStrings.xml (and might validly be expressed in within the tag in the worksheet).

For example, the first below is formatted, but in only one way, so the formatting is expressed elswhere, but the second has many different formats for different sections of the string.

<si>
  <t>mergedsubscript</t>
</si>
<si>
  <r>
    <t>in-cell</t>
  </r>
  <r>
    <rPr>
      <b/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>bold</t>
  </r>
  <r>
    <rPr>
      <i/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>italic</t>
  </r>
  <r>
    <rPr>
      <u/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>underline</t>
  </r>
  <r>
    <rPr>
      <b/>
      <i/>
      <u/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>all-three</t>
  </r>
  <r>
    <rPr>
      <sz val="11"/>
      <color rgb="FFFF0000"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>red</t>
  </r>
  <r>
    <rPr>
      <sz val="8"/>
      <color rgb="FFFF0000"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>size</t>
  </r>
  <r>
    <rPr>
      <sz val="8"/>
      <color rgb="FFFF0000"/>
      <rFont val="Arial"/>
      <family val="2"/>
    </rPr>
    <t>arial multilinecustomheight</t>
  </r>
</si>

In the second case, only the string is returned, not the formatting, unless there is also formatting applied to the whole cell.

I'd like to return the formatting somehow, but I'd need a use-case to make it a priority, and some suggestions of how to structure the information. The lazy way to return it would be in its raw form, which can be done like so:

    std::string inlineString;
    rapidxml::print(std::back_inserter(inlineString), *is, 0);

Add functions to convert ARGB to RGBA in hex and decimal

The documentation of the xlsx_cells function seem to suggest we can use rgb function to convert rgb values to hex strings but rgb function returns values in RGBA format, having the alpha channel at the end while the output of xlsx_cells seems to be ARGB, with alpha channel at the front.

Not sure if which form of output is the original intention but it calls for a change in either the documentation or functionality

Boundary formulas should return boundaries at completely blank cells

Data Validation

Hi nacnudus,

first of thank you very much for your work on this package. It bridges a gap in my workflow and so far it works very good and quite fast.

I would like to suggest to add one feature: extracting data validation rules from Excel cells

Best
Max

Custom number formats

It appears that any custom number format including an "E" is interpreted as a date. This is not always the case. For example: "0.0###E-0"

.Call() function error?

It seems an amazing project and I wanted to give it a try:

library(tidyxl)
titanic <- system.file("extdata/titanic.xlsx", package = "tidyxl")
x <- tidyxl::xlsx_cells(titanic, sheets="Sheet1")

Gives the following error.

Error in .Call("_tidyxl_xlsx_sheet_files_", PACKAGE = "tidyxl", path) : 
  "_tidyxl_xlsx_sheet_files_" not available for .Call() for package "tidyxl"

Here is my sessioninfo.

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rcpp_0.12.12         tidyxl_1.0.0         RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
 [1] assertthat_0.2.0 digest_0.6.12    dplyr_0.7.4      withr_2.0.0      cellranger_1.1.0 R6_2.2.0         magrittr_1.5    
 [8] rlang_0.1.2.9000 bindrcpp_0.2     devtools_1.13.3  RevoUtils_10.0.4 tools_3.4.0      glue_1.1.1       yaml_2.1.14     
[15] compiler_3.4.0   pkgconfig_2.0.1  memoise_1.1.0    bindr_0.1        tibble_1.3.4

In addition to that, I had to install "piton" dependency. May be worth to mention it in the example?

devtools::install_github("ironholds/piton")
devtools::install_github("nacnudus/tidyxl")

Legacy file formats

the xlsx_cells read-in function to work on xls files

Tokenize formulas more generally

This would be useful for detecting spreadsheet smells, and revealing the dependency structure of the spreadsheet.

tidy_xlsx imports empty but formatted cells

Hello,

Same as tidyverse/readxl#236

library(openxlsx)
dd <- data.frame(A=1:2, B=3:4)
wb <- createWorkbook()
addWorksheet(wb=wb, sheetName = "Sheet1")
# add a numFmt in the empty column 3
sty <- createStyle(numFmt = "yyyy/mm/dd")
addStyle(wb, 1, style = sty, rows = 2:3, cols = 3)
writeData(wb, "Sheet1", dd)
saveWorkbook(wb, "test00.xlsx", overwrite = TRUE)

And the blank but formatted cells (C2 and C3) are imported:

> dd <- tidy_xlsx("test00.xlsx", sheet=1)
> dd$data$Sheet1
  address row col content formula formula_type formula_ref formula_group
1      A1   1   1       0    <NA>         <NA>        <NA>            NA
2      B1   1   2       1    <NA>         <NA>        <NA>            NA
3      A2   2   1       1    <NA>         <NA>        <NA>            NA
4      B2   2   2       3    <NA>         <NA>        <NA>            NA
5      C2   2   3    <NA>    <NA>         <NA>        <NA>            NA
6      A3   3   1       2    <NA>         <NA>        <NA>            NA
7      B3   3   2       4    <NA>         <NA>        <NA>            NA
8      C3   3   3    <NA>    <NA>         <NA>        <NA>            NA

However it could make sense to import empty but formatted cells. But perhaps it would be nice to have an option for removing such cells when they are in an empty column ?

Hex badge

A package this good needs a hex!

tidyxl::xlsx_cells crashes when reading excel file

Hello,

Thank you for this great package. When using tidyxl, I sometimes encounter the situation where xlsx_cells crashes session. I have attached an example that crashes my R session whenever I try to read it using xlsx_cells.

test.xlsx

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.1.18

Denormalise 'shared' formulas

The deal here is that Excel doesn't write every formula to disk. Formulas that differ only by the cell references, (e.g. =A1 and =A2 are basically the same, bar a row-offset), are only written once, and other cells refer to that one instance. The application is expected to parse the 'master' formula, and then reconstruct the others based on their relative position.

That's fine for spreadsheet applications, which have to parse the formula anyway for calculation, but for tidyxl, which leaves the formulas as strings, it's a pain.

There are a few open-source parsers around, which I refer to in comments below. Most are handwritten. The only one that uses a parser generator and a grammar is XLParser, which targets analysis rather than calculation. Microsoft publishes a grammar, Excel (.xlsx) extensions to the office openxml spreadsheetml file format p.24, but it is 25 pages long, and horrible.

The minimum-viable parser for tidyxl would simply separate cell references from the rest of the formula, offset them, and put the pieces back together. I've written the grammar, so just need to design a formula object to hold the pieces and handle the offsetting.

Parsers

C++ (MIT licence), which I have forked for a couple of fixes, and wrapped in an R package.
C# parser-generator grammar and my attempt to port it to C++/Rcpp using the PEGTL header-only C++ parser generator. Another parser generator I could use is Boost C++ included in the BH package and used in readr, with similar downsides to PEGTL (it backtracks after 'successful' intermediate matches have acted their side-effects, which is too late). My port now also has a much simpler parser that simply extracts cell references from the rest of a formula.
Handwritten Apache POI parser in Java.
Gnumeric parser. Gnumeric is a goldmine of well-commented C++ code:
- range_translate
- xlsx importer
  - which references a handy bug report about shared formulas in the xlsx_cell_expr_begin function.
- formula interpreter that warns Recursive shared formula. Uh-oh.
  - the shared-formula offsets are applied in getRevV7, rather than using range_translate.
Python for pretty-printing Excel formulas (could be an opportunity to use crayon?)

Multiple lines comment

Hello,

I have this comment in a xlsx file:

tidyxl only returns the first line:

> z$data$Sheet1$comment
 [1] "StÃ©phane Laurent:" NA                   NA                  
 [4] NA                   NA                   NA                  
 [7] NA                   NA                   NA                  
[10] NA                   NA                   NA

(the encoding issue can be solved with Encoding(x) <- "UTF-8")

It is encoded like this in the xml file:

        <comment ref="A1" authorId="0">
            <text>
                <r>
                    <rPr><b/><sz val="9"/><color indexed="81"/><rFont val="Tahoma"/><family val="2"/></rPr>
                    <t>Stéphane Laurent:</t>
                </r>
                <r>
                    <rPr><sz val="9"/><color indexed="81"/><rFont val="Tahoma"/><family val="2"/></rPr>
                    <t xml:space="preserve">
                        hello</t>
                </r>
            </text>
        </comment>

It would be nice to get the concatenation of the two lines: Stéphane Laurent\r\nhello.

Specify range for xlsx_cells()

I have a large excel file that causes xlsx_cells to crash. It would be nice to say, "Only get this many rows and this many columns" when calling xlsx_cells.

Could something be put in xlsxsheet::parseSheetData?

Add Named regions to returned data

As you're already pulling the contents of the workbook.xml, how hard would it be to include the elements within as well?

That would aid people looking at formulas and also those looking sub-sections of data withing worksheets.

In a perfect world, readxl would do this, but @jennybc doesn't seem keen on the idea (readxl #79). At least a combination of this and readxl range selection would be an improvement.

Include implied sheet names in xlex reference tokens

When a formula references cells on the same sheet, the sheet name is implied, so it isn't available to the lexer. It would be convenient to put the sheet names in after lexing but before returning the output.

H/T @slavakozlov

Numeric columns rendering as dates

While looking over the results from #5, I ran into a situation where some columns with numeric contents return as dates.

This is occurring with the file example_formatting_complaint.xlsx

Reading in:

test_full <- xlsx_cells("~/Downloads/example_formatting_complaint.xlsx")

Then, displaying the offending columns:

test_full %>%
  filter(col %in% c(2,3,6)) %>%
  arrange(col) %>%
  select(address, data_type, date, character)

returns:

# A tibble: 15 x 4
   address data_type       date character
     <chr>     <chr>     <dttm>     <chr>
 1      B1 character         NA     Count
 2      B2      date 1900-01-01      <NA>
 3      B3      date 1900-01-02      <NA>
 4      B4      date 1900-01-03      <NA>
 5      B5      date 1900-01-04      <NA>
 6      C1 character         NA     Value
 7      C2      date 1900-01-02      <NA>
 8      C3      date 1900-01-05      <NA>
 9      C4      date 1900-01-08      <NA>
10      C5      date 1900-01-11      <NA>
11      F1 character         NA ideal_run
12      F2      date 1900-01-01      <NA>
13      F3      date 1900-01-02      <NA>
14      F4      date 1900-01-03      <NA>
15      F5      date 1900-01-01      <NA>

The file contents are (with the columns Count, Value, and ideal_run matching the data shown above):

ID	Count	Value	Note	ideal_product	ideal_run	ideal_category	ideal_result	ideal_altCat
A-1-TEST	1	2	Test was great	A	1	TEST	OK	TEST
A-2-PRODUCTION	2	5	Released product	A	2	PRODUCTION	OK	PRODUCTION
A-3-~~PRODUCTION~~	3	8	Flaw found after packaging; retained for testing	A	3	PRODUCTION	FAIL	failed_PRODUCTION
B-1-TEST	4	11	Test failed	B	1	TEST	OK	PRODUCTION

I am running:

Linux Ubuntu 16.04.3 LTS (xenial)
R version 3.4.1 (2017-06-30) (Single Candle)
tidyxl version tidyxl_0.2.3.9000 (reinstalled from development version today, confirmed it is current)
dplyr version dplyr_0.7.1 (only used for the filtering/output above; the date issue is present even without dplyr loaded)

Wrong formulas being read from excel (xlsx) file

I am using tidyxl version 1.0.1

packageVersion("tidyxl")
[1] ‘1.0.1’

I wanted to read the formulas out from different cells so I loaded the file

t <- tidyxl::xlsx_cells(path = paste0(file,file_name), sheets=sheetName)

and pulled out the formula using

t[grep(pattern = "^BN3114$", x = t$address,ignore.case = TRUE, value = FALSE),]$formula

which returns

"P3114*AN3105*AY3105-AC3105"

The actual excel file though has the formula as

"P3114*AN3114*AY3114-AC3114"

I noticed that this issue happens if the excel sheet cells are protected.

tidyxl::xlsx_cells crashes RStudio when reading excel file

library(tidyxl)
xlsx_cells("EvaluacionCensal_Secundaria_SEGUNDO_14112018_160622.xlsx")

# R version 3.4.4 (2018-03-15)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=Spanish_Peru.1252  LC_CTYPE=Spanish_Peru.1252    LC_MONETARY=Spanish_Peru.1252
# [4] LC_NUMERIC=C                  LC_TIME=Spanish_Peru.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.4.4 tools_3.4.4    yaml_2.2.0

EvaluacionCensal_Secundaria_SEGUNDO_14112018_160622.xlsx

Datetimes inconsistent with Excel

Placeholder to track tidyverse/readxl#187

Sheet not found by its index in $data but found in $formats

The weird file: https://www.dropbox.com/s/c1abqpk1ozf6rf1/GBSIextract.xlsx?dl=0

> x <- tidy_xlsx("GBSIextract.xlsx", sheets=1)
> # there's nothing in x$data:
> names(x$data)
character(0)
> names(x$formats)
[1] "local" "style"
> length(x$formats$local$numFmt)
[1] 27

The import works if we provide the sheet name:

> x <- tidy_xlsx("GBSIextract.xlsx", sheets="GBS-Ia")
> names(x$data)
[1] "GBS-Ia"
> dim(x$data$`GBS-Ia`)
[1] 81 20

Incompatible Excel theme

Weird little Excel file linked below which will crash the rsession when attempting to open with xlsx_cells. I have removed all the original sheets, but this file still causes the crash, so the xlsx file can be used a minimal example.
minimal.xlsx

Tested on both Windows and Linux. Was able to get a backtrace of rsession using GDB on Linux:

#0  0x00007fe0ee8c7e5e in xlsxstyles::cacheThemeRgb (this=this@entry=0x7ffda6d241e8, path="/home/dan/data-lab/servest-meths/data/test_measures/minimal.xlsx") at xlsxstyles.cpp:93
#1  0x00007fe0ee8c85aa in xlsxstyles::xlsxstyles (this=0x7ffda6d241e8, path="/home/dan/data-lab/servest-meths/data/test_measures/minimal.xlsx") at xlsxstyles.cpp:30
#2  0x00007fe0ee8acf62 in xlsxbook::xlsxbook (this=0x7ffda6d24180, path=..., sheet_paths=..., sheet_names=..., comments_paths=...) at xlsxbook.cpp:32
#3  0x00007fe0ee897e2e in xlsx_cells_ (path="/home/dan/data-lab/servest-meths/data/test_measures/minimal.xlsx", sheet_paths=..., sheet_names=..., comments_paths=...) at tidyxl.cpp:26
#4  0x00007fe0ee88341c in _tidyxl_xlsx_cells_ (pathSEXP=<optimised out>, sheet_pathsSEXP=<optimised out>, sheet_namesSEXP=<optimised out>, comments_pathsSEXP=<optimised out>) at RcppExports.cpp:18
#5  0x00007fe11132cc9c in ?? () from /usr/lib/R/lib/libR.so
#6  0x00007fe11136a20d in Rf_eval () from /usr/lib/R/lib/libR.so
#7  0x00007fe11136ccae in ?? () from /usr/lib/R/lib/libR.so
#8  0x00007fe111369fec in Rf_eval () from /usr/lib/R/lib/libR.so
#9  0x00007fe11136c00f in ?? () from /usr/lib/R/lib/libR.so
#10 0x00007fe111369db6 in Rf_eval () from /usr/lib/R/lib/libR.so
#11 0x00007fe11136ccae in ?? () from /usr/lib/R/lib/libR.so
#12 0x00007fe111369fec in Rf_eval () from /usr/lib/R/lib/libR.so
#13 0x00007fe11136c00f in ?? () from /usr/lib/R/lib/libR.so
#14 0x00007fe111369db6 in Rf_eval () from /usr/lib/R/lib/libR.so
#15 0x00007fe111393632 in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#16 0x00007fe111393a31 in ?? () from /usr/lib/R/lib/libR.so
#17 0x00007fe111393ae8 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#18 0x0000000000e4206f in rstudio::r::session::runEmbeddedR(rstudio::core::FilePath const&, rstudio::core::FilePath const&, bool, bool, SA_TYPE, rstudio::r::session::Callbacks const&, rstudio::r::session::InternalCallbacks*) ()
#19 0x0000000000e2061f in rstudio::r::session::run(rstudio::r::session::ROptions const&, rstudio::r::session::RCallbacks const&) ()
#20 0x0000000000718d53 in main ()

Line #0 gave me a clue that it had something to do with the theme1.xml inside the workbook, so I deleted it and repaired with Excel, which stops the crash. So it seems there's some issue with the RapidXML library and the theme settings in this weird Excel file.

cumsum needs na.rm in case boundary formula returns NA

missing cell

I am importing an irregularly structured xlsx file.

tidyxl::xlsx_cells runs without errors; however, one column has the wrong number of values. D25 is missing. Other values of column D are present and other values of row 25 are also present.

This is one of many data "forms." The other xlsx spreadsheets import correctly without any apparent problems.

Cell D25 in the problematic excel file appears to be formatted differently from the rest of the cells. Using Excel, I can "solve" the problem by copying the formatting from an adjacent good cell (i.e., D26), saving a new file, and importing. Similarly, I can copy the formatting of the bad cell (e.g., D25) and paste the formatting to a previously good cell (e.g., D24) resulting in the previously good cell (D24) now also missing (D24 and D25)!

Unfortunately, I can't easily share this file without dumping a lot of the sensitive data.

Given these observations, I attempted to recreate a MRE, to verify that formatting differences might be causing issues. I was unsuccessful recreating the problem. I tried the following:

library(openxlsx)
library(tidyxl)

## Create analagous sample dataframe

df <- head(mtcars)
df[, 4] <- c(110,110,"","","","")

## add df to "control" and "expiremental" workbooks using openxlsx

wb <- createWorkbook()
addWorksheet(wb, "mtcars")
addWorksheet(wb, "mtcars_dropped_cell")
writeData(wb, 1, df, colNames = TRUE)
writeData(wb, 2, df, colNames = TRUE)

## mimic style of worksheet

base_style <- createStyle(
  fontName = "Arial", 
  fontSize = 10, 
  fontColour = NULL,
  numFmt = "GENERAL", 
  border = NULL,
  borderColour = getOption("openxlsx.borderColour", "black"),
  borderStyle = getOption("openxlsx.borderStyle", "thin"), 
  bgFill = NULL,
  fgFill = NULL, 
  halign = "center", 
  valign = "bottom", 
  textDecoration = "bold",
  wrapText = TRUE, 
  textRotation = NULL, 
  indent = NULL
)

## apply baseline style to control and experimental worksheets

addStyle(
  wb, 1, style = base_style, 
  rows = 1:7,
  cols = 1:7,
  gridExpand = TRUE
)

addStyle(
  wb, 2, style = base_style, 
  rows = 1:7,
  cols = 1:7,
  gridExpand = TRUE
)

## mimic style of missing cell as closely as possible and apply only to cell 
## in experimental worksheet

try_style <- createStyle(
  fontName = "Calibri", 
  fontSize = 11, 
  fontColour = NULL,
  numFmt = "GENERAL", 
  border = NULL,
  borderColour = getOption("openxlsx.borderColour", "black"),
  borderStyle = getOption("openxlsx.borderStyle", "thin"), 
  bgFill = NULL,
  fgFill = NULL, 
  halign = NULL, 
  valign = , 
  textDecoration = NULL,
  wrapText = FALSE, 
  textRotation = NULL, 
  indent = NULL
)

## apply style to D5 in experimental worksheet

addStyle(
  wb,
  2,
  style = try_style,
  rows = 5,
  cols = 4
)

saveWorkbook(wb, "./mtcars_test.xlsx", overwrite = TRUE)

## import using tidyxl

tidy_base_wb <- xlsx_cells(
  path = "./mtcars_test.xlsx",
  sheets = 1, 
  check_filetype = TRUE
)

tidy_prob_wb <- xlsx_cells(
  path = "./mtcars_test.xlsx",
  sheets = 2, 
  check_filetype = TRUE
)

I'm open to suggestions on how to further troubleshoot this.

Reading csv files

Is there any change to use tidyxl for reading csv files?
I tried convert the file using rio::convert and than xlsx_cells but the conversion fails since the csv file is untidy.

Missing Row

I'm having a similar issue to that of #43 which was closed. I have highly unstructured .xlsx files which I am trying to parse and a good percentage of them have missing cells in the xlsx_cells tibble. I have attached a stripped down file that reproduces the error. In fact none of row 17 is present in the xlsx_cells tibble.
missingRow.xlsx

Crash when reading files created by Haskell xlsx library

Hello,

This file causes a total crash a R when one tries to read it with tidy_xlsx: outfile.xlsx

Possible cause: some XML tags are variant and vector, etc., instead of vt:variant, vt:vector.

I've checked that this is not due to the fact that the 0-th cellXf has not numFmt (like tidyverse/readxl#225 (comment)).

Reorganise the API

With more functionality, it makes sense to split things up into separate functions.

xlsx_cells() cells of all worksheets in one data frame (#16)
xlsx_formats() formatting and styles (#16)
xlsx_validation() data input validation rules, per sheet (#6)
xlsx_names() named formulas/ranges (#17)
xlsx_sheet_names() utility function

The xlex family:

xlex() tokenize a formula
xlex_edges() and xlex_vertices() operate on the parse tree
plot_xlex() visualise the parse tree
demo_xlex() shiny app of the above

Utility functions for common tasks:

Formatting that takes advantage of tab completion for discovery, using very long function names like fill_patternFill_fgColor_rgb(x, scope = c("style", "local")) or border_horizontal_style_indexed(x, scope =c("style", "local")), where xis a data frame withstyle_formatandlocal_format_id` in it.
A way to extract specific styles from all formats, e.g. bad <- get_style(x, "bad") where x is the return value of xlsx_formats().

nacnudus / tidyxl Goto Github PK

tidyxl's Introduction

tidyxl

Make cells tidy

Installation

Examples

Formatting

Comments

Formulas

Shared formulas

Array formulas

Formulas referring to other files

Tokenizing formulas

Named ranges

Data validation rules

Philosophy

Similar projects

Mailing list

tidyxl's People

Contributors

Stargazers

Watchers

Forkers

tidyxl's Issues

On Windows it fails

On Ubuntu it works

Windows and Excel version

Test file 'german_umlaute_utf8.xlsx'

Parsers

Recommend Projects

Recommend Topics

Recommend Org