hrbrmstr / docxtractr Goto Github PK

View Code? Open in Web Editor NEW

171.0 15.0 29.0 584 KB

:scissors: Extract Tables from Microsoft Word Documents with R

License: Other

R 100.00%

docx r rstats microsoft-word extract-tables table-extraction

docxtractr's Introduction

docxtractr

Extract Data Tables and Comments from ‘Microsoft’ ‘Word’ Documents

Description

An R package for extracting tables & comments out of Word documents (docx). Development versions are available here and production versions are on CRAN.

Microsoft Word docx files provide an XML structure that is fairly straightforward to navigate, especially when it applies to Word tables. The docxtractr package provides tools to determine table count, table structure and extract tables from Microsoft Word docx documents.

Many tables in Word documents are in twisted formats where there may be labels or other oddities mixed in that make it difficult to work with the underlying data. docxtractr provides a function—assign_colnames—that makes it easy to identify a particular row in a scraped (or any, really) data.frame as the one containing column names and have it become the column names, removing it and (optionally) all of the rows before it (since that’s usually what needs to be done).

What’s in the tin?

The following functions are implemented:

read_docx: Read in a Word document for table extraction
docx_describe_tbls: Returns a description of all the tables in the Word document
docx_describe_cmnts: Returns a description of all the comments in the Word document
docx_extract_tbl: Extract a table from a Word document
docx_extract_all_cmnts: Extract comments from a Word document
docx_extract_all_tbls: Extract all tables from a Word document (docx_extract_all is now deprecated)
docx_tbl_count: Get number of tables in a Word document
docx_cmnt_count: Get number of comments in a Word document
assign_colnames: Make a specific row the column names for the specified data.frame
mcga : Make column names great again
set_libreoffice_path: Point to Local soffice.exe File

The following data file are included:

system.file("examples/data.docx", package="docxtractr"): Word docx with 1 table
system.file("examples/data3.docx", package="docxtractr"): Word docx with 3 tables
system.file("examples/none.docx", package="docxtractr"): Word docx with 0 tables
system.file("examples/complex.docx", package="docxtractr"): Word docx with non-uniform tables
system.file("examples/comments.docx", package="docxtractr"): Word docx with comments
system.file("examples/realworld.docx", package="docxtractr"): A “real world” Word docx file with tables of all shapes and sizes
system.file("examples/trackchanges.docx", package="docxtractr"): Word docx with track changes in a table

Installation

# devtools::install_github("hrbrmstr/docxtractr")
# OR 
install.packages("docxtractr")

Usage

library(docxtractr)
library(tibble)
library(dplyr)

# current version
packageVersion("docxtractr")
#> [1] '0.6.0'

# one table
doc <- read_docx(system.file("examples/data.docx", package="docxtractr"))

docx_tbl_count(doc)
#> [1] 1

docx_describe_tbls(doc)
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/data.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]

docx_extract_tbl(doc, 1)
#> # A tibble: 3 x 4
#>   This  Is      A     Column  
#>   <chr> <chr>   <chr> <chr>   
#> 1 1     Cat     3.4   Dog     
#> 2 3     Fish    100.3 Bird    
#> 3 5     Pelican -99   Kangaroo

docx_extract_tbl(doc)
#> # A tibble: 3 x 4
#>   This  Is      A     Column  
#>   <chr> <chr>   <chr> <chr>   
#> 1 1     Cat     3.4   Dog     
#> 2 3     Fish    100.3 Bird    
#> 3 5     Pelican -99   Kangaroo

docx_extract_tbl(doc, header=FALSE)
#> NOTE: header=FALSE but table has a marked header row in the Word document
#> # A tibble: 4 x 4
#>   V1    V2      V3    V4      
#>   <chr> <chr>   <chr> <chr>   
#> 1 This  Is      A     Column  
#> 2 1     Cat     3.4   Dog     
#> 3 3     Fish    100.3 Bird    
#> 4 5     Pelican -99   Kangaroo

# url 

budget <- read_docx("http://rud.is/dl/1.DOCX")

docx_tbl_count(budget)
#> [1] 2

docx_describe_tbls(budget)
#> Word document [http://rud.is/dl/1.DOCX]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely

docx_extract_tbl(budget, 1)
#> # A tibble: 5 x 4
#>   ``                                 `Short-term Portfolio` `Long-term Portfolio` `Total Portfolio Values`
#>   <chr>                              <chr>                  <chr>                 <chr>                   
#> 1 Portfolio Balance (Market Value) * $  123,651,911         $ 294,704,136         $ 418,356,047           
#> 2 Effective Yield                    0.16 %                 1.42 %                1.05 %                  
#> 3 Avg. Weighted Maturity             11 Days                2.4 Years             1.7 Years               
#> 4 Net Earnings                       $      18,470          $      350,554        $      369,024          
#> 5 Benchmark**                        0.02 %                 0.41 %                0.27 %

docx_extract_tbl(budget, 2) 
#> # A tibble: 3 x 7
#>   ``                   `Amount of Funds … Maturity  `Effective Yiel… `Interpolated Y… `Total Return  … `Total Return  …
#>   <chr>                <chr>              <chr>     <chr>            <chr>            <chr>            <chr>           
#> 1 Short-Term Portfolio $ 123,651,911      11 days   0.16 %           0.01 %           0.013            0.160           
#> 2 Long-Term Portfolio  $ 294,704,136      2.4 years 1.42 %           0.41 %           0.437            0.250           
#> 3 Total Portfolio      $ 418,356,047      1.7 years 1.05 %           0.27 %           0.298            0.222

# three tables
doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))

docx_tbl_count(doc3)
#> [1] 3

docx_describe_tbls(doc3)
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/data3.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]

docx_extract_tbl(doc3, 3)
#> # A tibble: 6 x 2
#>   Foo   Bar  
#>   <chr> <chr>
#> 1 Aa    Bb   
#> 2 Dd    Ee   
#> 3 Gg    Hh   
#> 4 1     2    
#> 5 Zz    Jj   
#> 6 Tt    ii

# no tables
none <- read_docx(system.file("examples/none.docx", package="docxtractr"))

docx_tbl_count(none)
#> [1] 0

# wrapping in try since it will return an error
# use docx_tbl_count before trying to extract in scripts/production
try(docx_describe_tbls(none))
#> No tables in document
try(docx_extract_tbl(none, 2))

# 5 tables, with two in sketchy formats
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

docx_tbl_count(complx)
#> [1] 5

docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/complex.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]
#> 
#> Table 4
#>   total cells: 11
#>   row count  : 4
#>   uniform    : unlikely => found differing cell counts (3, 2) across some rows
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 5
#>   total cells: 21
#>   row count  : 7
#>   uniform    : likely!
#>   has header : unlikely

docx_extract_tbl(complx, 3, header=TRUE)
#> # A tibble: 6 x 2
#>   Foo   Bar  
#>   <chr> <chr>
#> 1 Aa    Bb   
#> 2 Dd    Ee   
#> 3 Gg    Hh   
#> 4 1     2    
#> 5 Zz    Jj   
#> 6 Tt    ii

docx_extract_tbl(complx, 4, header=TRUE)
#> # A tibble: 3 x 3
#>   Foo   Bar   Baz  
#>   <chr> <chr> <chr>
#> 1 Aa    BbCc  <NA> 
#> 2 Dd    Ee    Ff   
#> 3 Gg    Hh    ii

docx_extract_tbl(complx, 5, header=TRUE)
#> # A tibble: 6 x 3
#>   Foo   Bar   Baz  
#>   <chr> <chr> <chr>
#> 1 Aa    Bb    Cc   
#> 2 Dd    Ee    Ff   
#> 3 Gg    Hh    Ii   
#> 4 Jj88  Kk    Ll   
#> 5 ""    Uu    Ii   
#> 6 Hh    Ii    h

# a "real" Word doc
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr"))

docx_tbl_count(real_world)
#> [1] 8

# get all the tables
tbls <- docx_extract_all_tbls(real_world)

# see table 1
tbls[[1]]
#> # A tibble: 9 x 9
#>   V1                V2        V3         V4                     V5                     V6        V7      V8     V9     
#>   <chr>             <chr>     <chr>      <chr>                  <chr>                  <chr>     <chr>   <chr>  <chr>  
#> 1 Lesson 1:  Step 1 <NA>      <NA>       <NA>                   <NA>                   <NA>      <NA>    <NA>   <NA>   
#> 2 Country           Birthrate Death Rate Population Growth 2005 Population Growth 2050 Relative… Social… Socia… Social…
#> 3 USA               2.06      0.51%      0.92%                  -0.06%                 Post- In… Female… Stabl… Good t…
#> 4 China             1.62      0.3%       0.6%                   -0.58%                 Post- In… Govern… Techn… Urbani…
#> 5 Egypt             2.83      0.41%      2.0%                   1.32%                  Mature I… Not ye… More … Slight…
#> 6 India             2.35      0.34%      1.56%                  0.76%                  Post Ind… Econom… Pover… Becomi…
#> 7 Italy             1.28      0.72%      0.35%                  -1.33%                 Late Pos… Stable… Peopl… Better…
#> 8 Mexico            2.43      0.25%      1.41%                  0.96%                  Mature I… Better… Emigr… Econom…
#> 9 Nigeria           4.78      0.26%      2.46%                  3.58%                  End of M… Disease Peopl… People…

# make table 1 better
assign_colnames(tbls[[1]], 2)
#> # A tibble: 7 x 9
#>   Country Birthrate `Death Rate` `Population Grow… `Population Grow… `Relative place… `Social Factors… `Social Factors…
#>   <chr>   <chr>     <chr>        <chr>             <chr>             <chr>            <chr>            <chr>           
#> 1 USA     2.06      0.51%        0.92%             -0.06%            Post- Industrial Female Independ… Stable Birth Ra…
#> 2 China   1.62      0.3%         0.6%              -0.58%            Post- Industrial Government inte… Technology      
#> 3 Egypt   2.83      0.41%        2.0%              1.32%             Mature Industri… Not yet industr… More children n…
#> 4 India   2.35      0.34%        1.56%             0.76%             Post Industrial  Economic growth  Poverty         
#> 5 Italy   1.28      0.72%        0.35%             -1.33%            Late Post indus… Stable birth ra… People marry la…
#> 6 Mexico  2.43      0.25%        1.41%             0.96%             Mature Industri… Better health c… Emigration      
#> 7 Nigeria 4.78      0.26%        2.46%             3.58%             End of Mechaniz… Disease          People marry ea…
#> # ... with 1 more variable: `Social Factors 3` <chr>

# make table 1's column names great again 
mcga(assign_colnames(tbls[[1]], 2))
#> # A tibble: 7 x 9
#>   country birthrate death_rate population_growt… population_growt… relative_place_in… social_factors_1 social_factors_2
#>   <chr>   <chr>     <chr>      <chr>             <chr>             <chr>              <chr>            <chr>           
#> 1 USA     2.06      0.51%      0.92%             -0.06%            Post- Industrial   Female Independ… Stable Birth Ra…
#> 2 China   1.62      0.3%       0.6%              -0.58%            Post- Industrial   Government inte… Technology      
#> 3 Egypt   2.83      0.41%      2.0%              1.32%             Mature Industrial  Not yet industr… More children n…
#> 4 India   2.35      0.34%      1.56%             0.76%             Post Industrial    Economic growth  Poverty         
#> 5 Italy   1.28      0.72%      0.35%             -1.33%            Late Post industr… Stable birth ra… People marry la…
#> 6 Mexico  2.43      0.25%      1.41%             0.96%             Mature Industrial  Better health c… Emigration      
#> 7 Nigeria 4.78      0.26%      2.46%             3.58%             End of Mechanizat… Disease          People marry ea…
#> # ... with 1 more variable: social_factors_3 <chr>

# see table 5
tbls[[5]]
#> # A tibble: 5 x 6
#>   V1                V2      V3            V4        V5        V6      
#>   <chr>             <chr>   <chr>         <chr>     <chr>     <chr>   
#> 1 Lesson 2:  Step 1 <NA>    <NA>          <NA>      <NA>      <NA>    
#> 2 Nigeria           Default Prediction    + 5 years +15 years -5 years
#> 3 Birth rate        4.78    Goes Down     4.76      4.72      4.79    
#> 4 Death rate        0.36%   Stay the Same 0.42%     0.52%     0.3%    
#> 5 Population growth 3.58%   Goes Down     3.02%     2.32%     4.38%

# make table 5 better
assign_colnames(tbls[[5]], 2)
#> # A tibble: 3 x 6
#>   Nigeria           Default Prediction    `+ 5 years` `+15 years` `-5 years`
#>   <chr>             <chr>   <chr>         <chr>       <chr>       <chr>     
#> 1 Birth rate        4.78    Goes Down     4.76        4.72        4.79      
#> 2 Death rate        0.36%   Stay the Same 0.42%       0.52%       0.3%      
#> 3 Population growth 3.58%   Goes Down     3.02%       2.32%       4.38%

# preserve lines
intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
docx_extract_all_tbls(intracell_whitespace, preserve=TRUE)
#> [[1]]
#> # A tibble: 6 x 2
#>   `Test1:` Apple                                  
#>   <chr>    <chr>                                  
#> 1 Test2:   Banana                                 
#> 2 Test3:   "Cranberry\nDark"                      
#> 3 Test4:   "Elephant, Farm\nGrandpa"              
#> 4 Test5:   "Hat\nIgloo\nJackrabbit"               
#> 5 Test6:   " \nQuestion1\n[ ] Underwear\n[ ] VM\n"
#> 6 Test7:   Warm                                   
#> 
#> [[2]]
#> # A tibble: 2 x 4
#>   ``    Kite  Lemur      Madagascar
#>   <chr> <chr> <chr>      <chr>     
#> 1 Nanny Open  Port       Quarter   
#> 2 Rain  Sand  Television Unicorn   
#> 
#> [[3]]
#> # A tibble: 2 x 2
#>   `Test8:` `Xylophone\nYew`             
#>   <chr>    <chr>                        
#> 1 Test9:   Zebra                        
#> 2 Test10:  "Apple2\nBanana2\nCranberry2"

docx_extract_all_tbls(intracell_whitespace)
#> [[1]]
#> # A tibble: 6 x 2
#>   `Test1:` Apple                                                                                        
#>   <chr>    <chr>                                                                                        
#> 1 Test2:   Banana                                                                                       
#> 2 Test3:   CranberryDark                                                                                
#> 3 Test4:   Elephant, FarmGrandpa                                                                        
#> 4 Test5:   HatIglooJackrabbit                                                                           
#> 5 Test6:   KiteLemurMadagascarNannyOpenPortQuarterRainSandTelevisionUnicorn Question1[ ] Underwear[ ] VM
#> 6 Test7:   Warm                                                                                         
#> 
#> [[2]]
#> # A tibble: 2 x 4
#>   ``    Kite  Lemur      Madagascar
#>   <chr> <chr> <chr>      <chr>     
#> 1 Nanny Open  Port       Quarter   
#> 2 Rain  Sand  Television Unicorn   
#> 
#> [[3]]
#> # A tibble: 2 x 2
#>   `Test8:` XylophoneYew           
#>   <chr>    <chr>                  
#> 1 Test9:   Zebra                  
#> 2 Test10:  Apple2Banana2Cranberry2

# comments
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr"))

print(cmnts)
#> No tables in document
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/comments.docx]
#> 
#> Found 3 comments.
#> # A tibble: 1 x 2
#>   author    `# Comments`
#>   <chr>            <int>
#> 1 boB Rudis            3

glimpse(docx_extract_all_cmnts(cmnts))
#> Observations: 3
#> Variables: 5
#> $ id           <chr> "0", "1", "2"
#> $ author       <chr> "boB Rudis", "boB Rudis", "boB Rudis"
#> $ date         <chr> "2016-07-01T21:09:00Z", "2016-07-01T21:09:00Z", "2016-07-01T21:09:00Z"
#> $ initials     <chr> "bR", "bR", "bR"
#> $ comment_text <chr> "This is the first comment", "This is the second comment", "This is a reply to the second comm...

Track Changes (depends on `pandoc` being available)

# original
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr")
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> NOTE: header=FALSE but table has a marked header row in the Word document
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 21

# accept
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr"),
  track_changes = "accept"
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 2

# reject
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr"),
  track_changes = "reject"
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 1

Test Results

library(docxtractr)
library(testthat)
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#> 
#>     matches

date()
#> [1] "Tue Oct 23 08:10:10 2018"

test_dir("tests/")
#> ✔ | OK F W S | Context
#> ══ testthat results  ═════════════════════════════════════════════════
#> OK: 16 SKIPPED: 0 FAILED: 0
#> 
#> ══ Results ═══════════════════════════════════════════════════════════
#> Duration: 0.2 s
#> 
#> OK:       0
#> Failed:   0
#> Warnings: 0
#> Skipped:  0

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

docxtractr's People

Contributors

Stargazers

Watchers

docxtractr's Issues

Read HTTP/HTTPS Support

Fantastic package, would be great to add the ability to read doc and docx directly from the web like how it works in read_csv in readr. If I get sometime today I will look at that code and see how they ado it there

error when read_docx has url argument

Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when read_docx has url argument. Minimal reprex:

library(docxtractr)
#> Warning: package 'docxtractr' was built under R version 3.4.3
read_docx("http://rud.is/dl/1.DOCX")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

It looks like the call to download.file is causing this issue

download.file("http://rud.is/dl/1.DOCX", "temp.docx")
read_docx("temp.docx")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

To workaround this I can use mode = "wb"

download.file("http://rud.is/dl/1.DOCX", "wb.docx", mode = "wb")
read_docx("wb.docx")
#> Word document [wb.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

An alternative workaround is using httr package

library(httr)
#> Warning: package 'httr' was built under R version 3.4.3
r <- GET("http://rud.is/dl/1.DOCX")
bin <- content(r, "raw")
writeBin(bin, "myfile.docx")

read_docx("myfile.docx")
#> Word document [myfile.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

I thought I should raise this in case any other users have the same problem...

Fix the errors the recent tidyverse update introduced


Version: 0.5.0
Check: examples
Result: ERROR
    Running examples in ‘docxtractr-Ex.R’ failed
    The error most likely occurred in:
    
    > base::assign(".ptime", proc.time(), pos = "CheckExEnv")
    > ### Name: docx_extract_tbl
    > ### Title: Extract a table from a Word document
    > ### Aliases: docx_extract_tbl
    >
    > ### ** Examples
    >
    > doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))
    > docx_extract_tbl(doc3, 3)
    # A tibble: 6 x 2
     Foo Bar
     <chr> <chr>
    1 Aa Bb
    2 Dd Ee
    3 Gg Hh
    4 1 2
    5 Zz Jj
    6 Tt ii
    >
    > intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
    > docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE)
    Error: Column 1 must be named.
    Use .name_repair to specify repair.
    Execution halted
Flavors: r-devel-linux-x86_64-debian-clang, r-devel-linux-x86_64-debian-gcc, r-patched-linux-x86_64, r-release-linux-x86_64

Version: 0.5.0
Check: examples
Result: ERROR
    Running examples in ‘docxtractr-Ex.R’ failed
    The error most likely occurred in:
    
    > ### Name: docx_extract_tbl
    > ### Title: Extract a table from a Word document
    > ### Aliases: docx_extract_tbl
    >
    > ### ** Examples
    >
    > doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))
    > docx_extract_tbl(doc3, 3)
    # A tibble: 6 x 2
     Foo Bar
     <chr> <chr>
    1 Aa Bb
    2 Dd Ee
    3 Gg Hh
    4 1 2
    5 Zz Jj
    6 Tt ii
    >
    > intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
    > docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE)
    Error: Column 1 must be named.
    Use .name_repair to specify repair.
    Execution halted
Flavors: r-devel-linux-x86_64-fedora-clang, r-devel-linux-x86_64-fedora-gcc, r-devel-windows-ix86+x86_64, r-patched-solaris-x86, r-oldrel-windows-ix86+x86_64

Feature request: selected_text in docx_extract_all_cmnts()

It would be so nicer if docx_extract_all_cmnts() function adds a column for selected_text which contains each block of selected text corresponding to each comment. This way will allow users to easily track down what each comment refers to respectively.

Alternative way of Supporting for doc-files

Thanks a lot for such a great package.

I was trying out docxtractr::read_docx on doc files in Windows 10 using LibreOffice Version: 6.2.5.2 (x64).

It was horribly slow (due to LibreOffice I guess) if I don't open LibreOffice (manually outside R). Once I close and run the same code in R again it's slow.

fn <- "rough/messy_files/doc.doc"
library(tictoc)

# LibreOffice never opened in after last PC-reboot
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 285.63 sec elapsed
# 4.7 min !

# LibreOffice open
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 1.1 sec elapsed

# LibreOffice closed after open
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 24.21 sec elapsed

It is ok for a single file but if you have bundles of files then definitely not a good thing.
I was thinking if any alternative way of supporting doc files can be given to users.

Like use of docx4j as mentioned in this repository. Then the system dependency (on LibreOffice) will go away and I believe that will be smoother also.

Ref #5

CRAN Submission

Do you have a timeline for submitting to CRAN. Trying to rely on the convert_to_pdf function for PPTX for a package, so I have a requirement on docxtractr version that works, but that's only on GitHub. So I sent #27 to try to get it rolling.

Rhub checking: https://builder.r-hub.io/status/docxtractr_0.6.2.tar.gz-2796c8bc1a0e4bf2873c7e1673ad4834

Read special symbols within the tables in a .docx file

Thank you for docxtractr. While reading a .docx file, I have a special symbol (tick mark) within a table. Currently using docxtractr renders them as null character. Requesting to see if this can be enabled in this package.

Example: how to use docx_extract_tbl() with lapply()

Fantastic new package - thankyou.
In the examples please show how to wrap docx_extract_tbl() with lapply() to access all the tables in a document in one hit.

convert_to_pdf() fails but command-line equivalent works

Hello, I have a pptx file called slides.pptx that I want to convert to PDF using docxtractr.

When I try, I get this:

> docxtractr::convert_to_pdf("/tmp/slides.pptx")
Warning: failed to launch javaldx - java may not function correctly
The application cannot be started.
The component manager is not available.
("Cannot open uno ini file:///usr/lib/x86_64-linux-gnu/unorc at ./cppuhelper/source/defaultbootstrap.cxx:53")
Error in docxtractr::convert_to_pdf("/tmp/slides.pptx") :
  Conversion from PPTX to PDF did not succeed
In addition: Warning message:
In system(cmd, intern = TRUE) :
  running command '"/usr/bin/soffice" --convert-to pdf --headless --outdir "/tmp/RtmptPPhMp" "/tmp/RtmptPPhMp/file151a369f9d98.pptx"' had status 139

However, when I take that soffice command line at the end, and change the last parameter to /tmp/slides.pptx, it works (despite throwing a warning). It produces the output PDF in /tmp and I verified that its contents are correct:

root@4d9dd60d0e79:/app# "/usr/bin/soffice" --convert-to pdf --headless --outdir "/tmp/" "/tmp/slides.pptx"
Warning: failed to launch javaldx - java may not function correctly
convert /tmp/slides.pptx -> /tmp/slides.pdf using filter : impress_pdf_Export
root@4d9dd60d0e79:/app#

So, you might ask, why don't I just use the command line. Well, this is part of a larger software stack that relies on docxtractr, and I don't want to reinvent the wheel.

This is inside a Docker container based on Debian 12, with R-4.2.2 and docxtractr 0.6.5.

BTW, I do not have the javaldx program in the container (although I installed libreoffice via apt-get) but it does not seem to matter - despite the warning, soffice converts the pptx successfully.

So in a nutshell I am wondering why I can successfully convert pptx to pdf on the command line but not with docxtractr which apparently uses pretty much the same command line under the hood.

Thanks

Text outside of tables

I know the impetus of this package is to read data from .docx tables, but I am wondering if the xml structure would permit pulling text from beneath a specific heading. In a .docx with a common format, for example:

Introduction

Chicken ullamco meatball, magna tail elit meatloaf aliquip jerky cillum. Id chicken ut, meatloaf dolore jowl cupim porchetta aliqua tempor tenderloin sausage quis aute. Et deserunt est ground round, chicken ea do ball tip laboris tri-tip ullamco id occaecat chuck. Brisket cupim meatloaf veniam porchetta picanha meatball quis flank t-bone elit dolor rump.

Materials and Methods

Bacon ipsum dolor amet bacon dolore commodo id. Est veniam nostrud hamburger eu meatball nisi ut. Ham hock adipisicing anim aliqua ullamco. In ad cow flank meatball. Ut ham laboris incididunt pancetta do venison dolor fatback. Sint alcatra incididunt, shank sunt ground round commodo meatball tail filet mignon.

something like:
docx_extract_txt(doc, heading = "Introduction")

"Chicken ullamco meatball, magna ..."

returning a string of text. Not sure if this would be possible, but I think it could be extremely useful.
EDIT: I replaced "header" with "heading", as that seems to be more precise usage of what I'm after in MS Word parlance.

Extract contents of document footers?

This package is incredibly handy, thanks!

I don't know much about XML, but looking at an unzipped docx file, it appears that, if the footer exists, each section of document has a corresponding footer XML file (e.g., footer1.xml). Would it be possible to add a function that iterates through and extracts a document's footers (and headers, I suppose, but for the motivating use case, it's the footers I'm interested in)? It doesn't look like footers have IDs the same way comments do, but it would be great to be able to retain the section number of each footer... Is there any other information that would be useful?

Thanks in advance!

Tables with track changes badly read

Hi,
Thank you for great package. I have document with tables. The document is under track control. When I read it it does not read correctly the values in table. Pls see attached file and example below. It should read "2" not "21"
docxtractr_bug.docx

Thanks for lookig into it!

Tomas


> library(docxtractr)
> path<-"C:\\Users\\tomas_hovorka\\Documents\\docxtractr_bug.docx"
> 
> d1<-read_docx(path)
> t1a<-docx_extract_tbl(d1, 1)
> t1a
# A tibble: 0 x 1
# ... with 1 variable: `21` <chr>
> 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Czech_Czech Republic.1250  LC_CTYPE=Czech_Czech Republic.1250    LC_MONETARY=Czech_Czech Republic.1250 LC_NUMERIC=C                         
[5] LC_TIME=Czech_Czech Republic.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] docxtractr_0.5.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16          utf8_1.1.4            crayon_1.3.4          dplyr_0.7.4           assertthat_0.1        R6_2.2.2              magrittr_1.5         
 [8] pillar_1.2.3          httr_1.3.1            rlang_0.2.0           rstudioapi_0.7.0-9000 bindrcpp_0.2          xml2_1.2.0            tools_3.3.1          
[15] glue_1.2.0            purrr_0.2.2           pkgconfig_2.0.1       bindr_0.1.1           tibble_1.4.2         
>

input issue

Hello, everyone,
when I used the function docx_extract_all_tbls() to extract data from one docx file that outputted from SAS, there was an issue which showed that "Error: Must pass in a 'docx' object" .
Then I checked the function read_docx(), the new item was coming as follows:
rdocx document with 3063 element(s)

styles:
Normal Default Paragraph Font Normal Table No List toc 1 Hyperlink header
"paragraph" "character" "table" "numbering" "paragraph" "character" "paragraph"
页眉字符 footer 页脚字符
"character" "paragraph" "character"
Content at cursor location:
level num_id text style_name content_type
1 NA NA NA paragraph

I think the question in the step of read_docx() but I don't how to solve this problem. would somebody get me hint?

Thanks

DOC: soffice failure on Plumber

In plumber, I was getting:

running command '"/usr/bin/soffice" --convert-to pdf --headless --outdir "/opt/rstudio-connect/mnt/tmp/Rtmp60Kr1p" "/opt/rstudio-connect/mnt/tmp/Rtmp60Kr1p/file4b3475b6b63f.pptx"' had status 1

So I had to add:

LD_LIBRARY_PATH = Sys.getenv("LD_LIBRARY_PATH")
Sys.setenv(
  LD_LIBRARY_PATH=
    paste0(
      "/usr/lib/libreoffice/program",
      ":",
      LD_LIBRARY_PATH))

which fixed the issue. For future users.

extract text associated with the comment

Very useful package! I really appreciate it! Thank you!

Is there a way to extract the text associated with the comments?

I did unzip the attached file test.docx, and I did explore the unzipped files.

The word/document.xml file have the following "marks":

<w:commentRangeStart w:id="1"/>
<w:r>
<w:rPr/>
<w:t xml:space="preserve">
Five quacking zephyrs jolt my wax bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk. A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck.
</w:t>
</w:r>
<w:commentRangeEnd w:id="1"/>

With the following associated comments in the word/comments.xml file:

<w:comment w:id="1" w:author="Unknown Author" w:date="2018-04-05T13:58:02Z" w:initials="">
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Noto Sans CJK SC Regular" w:cs="FreeSans" w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"/>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:i w:val="false"/>
<w:iCs w:val="false"/>
<w:caps w:val="false"/>
<w:smallCaps w:val="false"/>
<w:strike w:val="false"/>
<w:dstrike w:val="false"/>
<w:outline w:val="false"/>
<w:shadow w:val="false"/>
<w:emboss w:val="false"/>
<w:imprint w:val="false"/>
<w:color w:val="auto"/>
<w:spacing w:val="0"/>
<w:w w:val="100"/>
<w:position w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="24"/>
<w:u w:val="none"/>
<w:vertAlign w:val="baseline"/>
<w:em w:val="none"/>
<w:lang w:bidi="hi-IN" w:eastAsia="zh-CN" w:val="en-US"/>
</w:rPr>
<w:t>All paragraph.</w:t>
</w:r>
</w:p>
</w:comment>

These things seem linked by the w:id="1" in both word/document.xml and word/comments.xml files.

It would be very interesting if your docx_extract_all_cmnts() function informs a tibble containing a column with the text associated with the comment.

test.docx.zip

Possible to have output as tibble?

This is an awesome pkg, and I find assign_colnames useful for all kinds of data input besides docx files.

I wonder if you would consider allowing the tables output by docx_extract_tbl() to be tibbles? For a new data set coming in to my R environment I find it very handy to see the column classes that the tibble print method gives.

Could we change this line to as_tibble(dat) ?

You have tibble in suggests already, so I don't think this will change the dependencies. Just thought I'd ask before making at PR in case you have a good reason not to do this.

docx_extract_all_cmnts(..., include_text = TRUE) failing on edge case

First off, thank you for this package, it's really useful.

I've run into an interesting scenario where the argument include_text = TRUE fails for a word document.

Here are two near identical word documents:
works.docx
does not work.docx

Both just have the text: "Manuscript text" with the comment "comment text"

However the include_text argument fails for "does not work.docx" due to the introduction to a tab symbol.

"does not work.docx" |> 
  docxtractr::read_docx() |> 
  docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#>   id    author          date                 initials comment_text word_src
#> * <chr> <chr>           <chr>                <chr>    <chr>        <chr>   
#> 1 0     James Conigrave 2022-01-18T02:08:00Z ""       Comment text ""

"works.docx" |> 
  docxtractr::read_docx() |> 
  docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#>   id    author          date                 initials comment_text word_src     
#> * <chr> <chr>           <chr>                <chr>    <chr>        <chr>        
#> 1 0     James Conigrave 2022-01-18T02:08:00Z ""       Comment text Manuscript t~

It appears that in the file "does not work" there are small changes to the xml which break the functionality. I'm not quite sure how they have been caused but would love a fix if you have time!

can't read from a local file

Hello,

Thanks for the great package. I'm having issues reading a doc file from my workspace.
For example,

doc1 <- read_docx("myfile.docx")

This simple code doesn't work. I get -
Error: 'C:\Users\smithj\AppData\Local\Temp\1\RtmpqqbQyW/docdata/word/document.xml' does not exist.

I can read in from the examples like:
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

I don't want to copy all my files to the package example directories. Maybe I'm doing something wrong?? I tried to google but haven't had a success.

Thanks,

doc-file

I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").

Any idea how to read the old ".doc" format?
(I hope it's OK to post this as an issue. Just delete it if not^^)

Error when assigning column names if the table has only one column

Hi there, thanks for the package, very useful!

I get the following error when assigning a row as a column name if the scraped Word table has only one column.

Error in names[old] <- names(x)[j[old]] : replacement has length zero

Here is a potential solution that I was able to use:

assign_colnames_v2 = function (dat, row, remove = TRUE, remove_previous = remove) 
{
  if ((row > nrow(dat)) | (row < 1)) {
    return(dat)
  }
  
  # Save the original class of 'dat' to reassign later
  d_class <- class(dat)
  
  # Convert to data frame to ensure consistent handling
  dat <- as.data.frame(dat, stringsAsFactors = FALSE)
  
  # Check if 'dat' has only one column
  if (ncol(dat) == 1) {
    # Special handling for one-column data frame
    col_name <- as.character(dat[row, 1])
    
    # Remove the row that is now the column name, if required
    if (remove) {
      if (remove_previous) {
        dat <- dat[(row+1):nrow(dat), , drop = FALSE]
      } else {
        dat <- dat[-row, , drop = FALSE]
      }
    }
    
    # Set the column name
    colnames(dat) <- col_name
  } else {
    # For data frames with more than one column, use the original approach
    colnames(dat) <- as.character(unlist(dat[row, ]))
    
    # Determine rows to remove
    start <- row
    end <- row
    if (remove_previous) {
      start <- 1
    }
    
    # Remove the rows
    dat <- dat[-(start:end), , drop = FALSE]
  }
  
  # Reset the row names
  rownames(dat) <- NULL
  
  # Reassign the original class, especially if 'dat' was a tibble
  class(dat) <- d_class
  
  # Return the modified data frame
  return(dat)
}

Hope this is useful for other people as well.

Get table heading or page number for tables

Hello,
This package has been incredibly helpful. Is there a way to include (or get) page numbers for each table? Or can we read in particular number of pages and extract tables from it? Or alternatively, is there a way to get the text from one line right above the table?

Thanks,
Mekala

Edit and upload comments to word docx

This is a great package which I use all the time to check feedback comments on student assignments.

Was wondering however if there is some way to edit a comment pulled through docx_extract_all_cmnts and then upload it back to the xml and write the word doc out? I am not sure if the xml holds any position (line and page) information so maybe it isn't possible. If there was it would be a great thing to have in the package to help fix comments without having to open up the word file and find it. Just a suggestion.

Thanks again for the work though!

Is there a way to conserve newlines from extracted tables?

I have a word doc with tables containing cells with newlines. When I extract the cell, all newlines appear to be simply deleted. Any way to keep the formatting?

Numbers are lost when reading cells with numbered lists

First of all, thanks for the amazing package!

I am trying to read the contents of a docx table and having issues with numbered lists. If I enter the numbers by hand, all is well, but if I use a numbered list, the numbers are lost when extracting the table.

As you can see in the reproducible example below, row 2 (Items) is extracted fine ("1. First item\n2. Second item"), but in row 3 (Items2), the numbers are lost ("First item\nSecond item").

DOC = docxtractr::read_docx("https://github.com/gorkang/BUG_docxtractr/blob/master/test.docx?raw=true")
TABLE = docxtractr::docx_extract_tbl(DOC, preserve = TRUE, header = FALSE)

# Not using a numbered list. All is fine
TABLE$V2[2]
#> [1] "1. First item\n2. Second item"

# In row 3 we use a numbered list. Numbers are lost
TABLE$V2[3]
#> [1] "First item\nSecond item"

Thanks!

^{Created on 2022-08-03 by the reprex package (v2.0.1)}