hrbrmstr / docxtractr Goto Github PK

:scissors: Extract Tables from Microsoft Word Documents with R

License: Other

R 100.00%

docx r rstats microsoft-word extract-tables table-extraction

docxtractr's Issues

error when read_docx has url argument

Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when read_docx has url argument. Minimal reprex:

library(docxtractr)
#> Warning: package 'docxtractr' was built under R version 3.4.3
read_docx("http://rud.is/dl/1.DOCX")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

It looks like the call to download.file is causing this issue

download.file("http://rud.is/dl/1.DOCX", "temp.docx")
read_docx("temp.docx")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

To workaround this I can use mode = "wb"

download.file("http://rud.is/dl/1.DOCX", "wb.docx", mode = "wb")
read_docx("wb.docx")
#> Word document [wb.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

An alternative workaround is using httr package

library(httr)
#> Warning: package 'httr' was built under R version 3.4.3
r <- GET("http://rud.is/dl/1.DOCX")
bin <- content(r, "raw")
writeBin(bin, "myfile.docx")

read_docx("myfile.docx")
#> Word document [myfile.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

I thought I should raise this in case any other users have the same problem...

Possible to have output as tibble?

This is an awesome pkg, and I find assign_colnames useful for all kinds of data input besides docx files.

I wonder if you would consider allowing the tables output by docx_extract_tbl() to be tibbles? For a new data set coming in to my R environment I find it very handy to see the column classes that the tibble print method gives.

Could we change this line to as_tibble(dat) ?

You have tibble in suggests already, so I don't think this will change the dependencies. Just thought I'd ask before making at PR in case you have a good reason not to do this.

Read HTTP/HTTPS Support

Fantastic package, would be great to add the ability to read doc and docx directly from the web like how it works in read_csv in readr. If I get sometime today I will look at that code and see how they ado it there

Alternative way of Supporting for doc-files

Thanks a lot for such a great package.

I was trying out docxtractr::read_docx on doc files in Windows 10 using LibreOffice Version: 6.2.5.2 (x64).

It was horribly slow (due to LibreOffice I guess) if I don't open LibreOffice (manually outside R). Once I close and run the same code in R again it's slow.

fn <- "rough/messy_files/doc.doc"
library(tictoc)

# LibreOffice never opened in after last PC-reboot
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 285.63 sec elapsed
# 4.7 min !

# LibreOffice open
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 1.1 sec elapsed

# LibreOffice closed after open
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 24.21 sec elapsed

It is ok for a single file but if you have bundles of files then definitely not a good thing.
I was thinking if any alternative way of supporting doc files can be given to users.

Like use of docx4j as mentioned in this repository. Then the system dependency (on LibreOffice) will go away and I believe that will be smoother also.

Ref #5

Fix the errors the recent tidyverse update introduced


Version: 0.5.0
Check: examples
Result: ERROR
    Running examples in ‘docxtractr-Ex.R’ failed
    The error most likely occurred in:
    
    > base::assign(".ptime", proc.time(), pos = "CheckExEnv")
    > ### Name: docx_extract_tbl
    > ### Title: Extract a table from a Word document
    > ### Aliases: docx_extract_tbl
    >
    > ### ** Examples
    >
    > doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))
    > docx_extract_tbl(doc3, 3)
    # A tibble: 6 x 2
     Foo Bar
     <chr> <chr>
    1 Aa Bb
    2 Dd Ee
    3 Gg Hh
    4 1 2
    5 Zz Jj
    6 Tt ii
    >
    > intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
    > docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE)
    Error: Column 1 must be named.
    Use .name_repair to specify repair.
    Execution halted
Flavors: r-devel-linux-x86_64-debian-clang, r-devel-linux-x86_64-debian-gcc, r-patched-linux-x86_64, r-release-linux-x86_64

Version: 0.5.0
Check: examples
Result: ERROR
    Running examples in ‘docxtractr-Ex.R’ failed
    The error most likely occurred in:
    
    > ### Name: docx_extract_tbl
    > ### Title: Extract a table from a Word document
    > ### Aliases: docx_extract_tbl
    >
    > ### ** Examples
    >
    > doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))
    > docx_extract_tbl(doc3, 3)
    # A tibble: 6 x 2
     Foo Bar
     <chr> <chr>
    1 Aa Bb
    2 Dd Ee
    3 Gg Hh
    4 1 2
    5 Zz Jj
    6 Tt ii
    >
    > intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
    > docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE)
    Error: Column 1 must be named.
    Use .name_repair to specify repair.
    Execution halted
Flavors: r-devel-linux-x86_64-fedora-clang, r-devel-linux-x86_64-fedora-gcc, r-devel-windows-ix86+x86_64, r-patched-solaris-x86, r-oldrel-windows-ix86+x86_64

Error when assigning column names if the table has only one column

Hi there, thanks for the package, very useful!

I get the following error when assigning a row as a column name if the scraped Word table has only one column.

Error in names[old] <- names(x)[j[old]] : replacement has length zero

Here is a potential solution that I was able to use:

assign_colnames_v2 = function (dat, row, remove = TRUE, remove_previous = remove) 
{
  if ((row > nrow(dat)) | (row < 1)) {
    return(dat)
  }
  
  # Save the original class of 'dat' to reassign later
  d_class <- class(dat)
  
  # Convert to data frame to ensure consistent handling
  dat <- as.data.frame(dat, stringsAsFactors = FALSE)
  
  # Check if 'dat' has only one column
  if (ncol(dat) == 1) {
    # Special handling for one-column data frame
    col_name <- as.character(dat[row, 1])
    
    # Remove the row that is now the column name, if required
    if (remove) {
      if (remove_previous) {
        dat <- dat[(row+1):nrow(dat), , drop = FALSE]
      } else {
        dat <- dat[-row, , drop = FALSE]
      }
    }
    
    # Set the column name
    colnames(dat) <- col_name
  } else {
    # For data frames with more than one column, use the original approach
    colnames(dat) <- as.character(unlist(dat[row, ]))
    
    # Determine rows to remove
    start <- row
    end <- row
    if (remove_previous) {
      start <- 1
    }
    
    # Remove the rows
    dat <- dat[-(start:end), , drop = FALSE]
  }
  
  # Reset the row names
  rownames(dat) <- NULL
  
  # Reassign the original class, especially if 'dat' was a tibble
  class(dat) <- d_class
  
  # Return the modified data frame
  return(dat)
}

Hope this is useful for other people as well.

convert_to_pdf() fails but command-line equivalent works

Hello, I have a pptx file called slides.pptx that I want to convert to PDF using docxtractr.

When I try, I get this:

> docxtractr::convert_to_pdf("/tmp/slides.pptx")
Warning: failed to launch javaldx - java may not function correctly
The application cannot be started.
The component manager is not available.
("Cannot open uno ini file:///usr/lib/x86_64-linux-gnu/unorc at ./cppuhelper/source/defaultbootstrap.cxx:53")
Error in docxtractr::convert_to_pdf("/tmp/slides.pptx") :
  Conversion from PPTX to PDF did not succeed
In addition: Warning message:
In system(cmd, intern = TRUE) :
  running command '"/usr/bin/soffice" --convert-to pdf --headless --outdir "/tmp/RtmptPPhMp" "/tmp/RtmptPPhMp/file151a369f9d98.pptx"' had status 139

However, when I take that soffice command line at the end, and change the last parameter to /tmp/slides.pptx, it works (despite throwing a warning). It produces the output PDF in /tmp and I verified that its contents are correct:

root@4d9dd60d0e79:/app# "/usr/bin/soffice" --convert-to pdf --headless --outdir "/tmp/" "/tmp/slides.pptx"
Warning: failed to launch javaldx - java may not function correctly
convert /tmp/slides.pptx -> /tmp/slides.pdf using filter : impress_pdf_Export
root@4d9dd60d0e79:/app#

So, you might ask, why don't I just use the command line. Well, this is part of a larger software stack that relies on docxtractr, and I don't want to reinvent the wheel.

This is inside a Docker container based on Debian 12, with R-4.2.2 and docxtractr 0.6.5.

BTW, I do not have the javaldx program in the container (although I installed libreoffice via apt-get) but it does not seem to matter - despite the warning, soffice converts the pptx successfully.

So in a nutshell I am wondering why I can successfully convert pptx to pdf on the command line but not with docxtractr which apparently uses pretty much the same command line under the hood.

Thanks

Get table heading or page number for tables

Hello,
This package has been incredibly helpful. Is there a way to include (or get) page numbers for each table? Or can we read in particular number of pages and extract tables from it? Or alternatively, is there a way to get the text from one line right above the table?

Thanks,
Mekala

Is there a way to conserve newlines from extracted tables?

I have a word doc with tables containing cells with newlines. When I extract the cell, all newlines appear to be simply deleted. Any way to keep the formatting?

Numbers are lost when reading cells with numbered lists

First of all, thanks for the amazing package!

I am trying to read the contents of a docx table and having issues with numbered lists. If I enter the numbers by hand, all is well, but if I use a numbered list, the numbers are lost when extracting the table.

As you can see in the reproducible example below, row 2 (Items) is extracted fine ("1. First item\n2. Second item"), but in row 3 (Items2), the numbers are lost ("First item\nSecond item").

DOC = docxtractr::read_docx("https://github.com/gorkang/BUG_docxtractr/blob/master/test.docx?raw=true")
TABLE = docxtractr::docx_extract_tbl(DOC, preserve = TRUE, header = FALSE)

# Not using a numbered list. All is fine
TABLE$V2[2]
#> [1] "1. First item\n2. Second item"

# In row 3 we use a numbered list. Numbers are lost
TABLE$V2[3]
#> [1] "First item\nSecond item"

Thanks!

^{Created on 2022-08-03 by the reprex package (v2.0.1)}

Read special symbols within the tables in a .docx file

Thank you for docxtractr. While reading a .docx file, I have a special symbol (tick mark) within a table. Currently using docxtractr renders them as null character. Requesting to see if this can be enabled in this package.

CRAN Submission

Do you have a timeline for submitting to CRAN. Trying to rely on the convert_to_pdf function for PPTX for a package, so I have a requirement on docxtractr version that works, but that's only on GitHub. So I sent #27 to try to get it rolling.

Rhub checking: https://builder.r-hub.io/status/docxtractr_0.6.2.tar.gz-2796c8bc1a0e4bf2873c7e1673ad4834

Example: how to use docx_extract_tbl() with lapply()

Fantastic new package - thankyou.
In the examples please show how to wrap docx_extract_tbl() with lapply() to access all the tables in a document in one hit.

Feature request: selected_text in docx_extract_all_cmnts()

It would be so nicer if docx_extract_all_cmnts() function adds a column for selected_text which contains each block of selected text corresponding to each comment. This way will allow users to easily track down what each comment refers to respectively.

Extract contents of document footers?

This package is incredibly handy, thanks!

I don't know much about XML, but looking at an unzipped docx file, it appears that, if the footer exists, each section of document has a corresponding footer XML file (e.g., footer1.xml). Would it be possible to add a function that iterates through and extracts a document's footers (and headers, I suppose, but for the motivating use case, it's the footers I'm interested in)? It doesn't look like footers have IDs the same way comments do, but it would be great to be able to retain the section number of each footer... Is there any other information that would be useful?

Thanks in advance!

input issue

Hello, everyone,
when I used the function docx_extract_all_tbls() to extract data from one docx file that outputted from SAS, there was an issue which showed that "Error: Must pass in a 'docx' object" .
Then I checked the function read_docx(), the new item was coming as follows:
rdocx document with 3063 element(s)

styles:
Normal Default Paragraph Font Normal Table No List toc 1 Hyperlink header
"paragraph" "character" "table" "numbering" "paragraph" "character" "paragraph"
页眉字符 footer 页脚字符
"character" "paragraph" "character"
Content at cursor location:
level num_id text style_name content_type
1 NA NA NA paragraph

I think the question in the step of read_docx() but I don't how to solve this problem. would somebody get me hint?

Thanks

docx_extract_all_cmnts(..., include_text = TRUE) failing on edge case

First off, thank you for this package, it's really useful.

I've run into an interesting scenario where the argument include_text = TRUE fails for a word document.

Here are two near identical word documents:
works.docx
does not work.docx

Both just have the text: "Manuscript text" with the comment "comment text"

However the include_text argument fails for "does not work.docx" due to the introduction to a tab symbol.

"does not work.docx" |> 
  docxtractr::read_docx() |> 
  docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#>   id    author          date                 initials comment_text word_src
#> * <chr> <chr>           <chr>                <chr>    <chr>        <chr>   
#> 1 0     James Conigrave 2022-01-18T02:08:00Z ""       Comment text ""

"works.docx" |> 
  docxtractr::read_docx() |> 
  docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#>   id    author          date                 initials comment_text word_src     
#> * <chr> <chr>           <chr>                <chr>    <chr>        <chr>        
#> 1 0     James Conigrave 2022-01-18T02:08:00Z ""       Comment text Manuscript t~

It appears that in the file "does not work" there are small changes to the xml which break the functionality. I'm not quite sure how they have been caused but would love a fix if you have time!

DOC: soffice failure on Plumber

In plumber, I was getting:

running command '"/usr/bin/soffice" --convert-to pdf --headless --outdir "/opt/rstudio-connect/mnt/tmp/Rtmp60Kr1p" "/opt/rstudio-connect/mnt/tmp/Rtmp60Kr1p/file4b3475b6b63f.pptx"' had status 1

So I had to add:

LD_LIBRARY_PATH = Sys.getenv("LD_LIBRARY_PATH")
Sys.setenv(
  LD_LIBRARY_PATH=
    paste0(
      "/usr/lib/libreoffice/program",
      ":",
      LD_LIBRARY_PATH))

which fixed the issue. For future users.

can't read from a local file

Hello,

Thanks for the great package. I'm having issues reading a doc file from my workspace.
For example,

doc1 <- read_docx("myfile.docx")

This simple code doesn't work. I get -
Error: 'C:\Users\smithj\AppData\Local\Temp\1\RtmpqqbQyW/docdata/word/document.xml' does not exist.

I can read in from the examples like:
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

I don't want to copy all my files to the package example directories. Maybe I'm doing something wrong?? I tried to google but haven't had a success.

Thanks,

extract text associated with the comment

Very useful package! I really appreciate it! Thank you!

Is there a way to extract the text associated with the comments?

I did unzip the attached file test.docx, and I did explore the unzipped files.

The word/document.xml file have the following "marks":

<w:commentRangeStart w:id="1"/>
<w:r>
<w:rPr/>
<w:t xml:space="preserve">
Five quacking zephyrs jolt my wax bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk. A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck.
</w:t>
</w:r>
<w:commentRangeEnd w:id="1"/>

With the following associated comments in the word/comments.xml file:

<w:comment w:id="1" w:author="Unknown Author" w:date="2018-04-05T13:58:02Z" w:initials="">
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Noto Sans CJK SC Regular" w:cs="FreeSans" w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"/>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:i w:val="false"/>
<w:iCs w:val="false"/>
<w:caps w:val="false"/>
<w:smallCaps w:val="false"/>
<w:strike w:val="false"/>
<w:dstrike w:val="false"/>
<w:outline w:val="false"/>
<w:shadow w:val="false"/>
<w:emboss w:val="false"/>
<w:imprint w:val="false"/>
<w:color w:val="auto"/>
<w:spacing w:val="0"/>
<w:w w:val="100"/>
<w:position w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="24"/>
<w:u w:val="none"/>
<w:vertAlign w:val="baseline"/>
<w:em w:val="none"/>
<w:lang w:bidi="hi-IN" w:eastAsia="zh-CN" w:val="en-US"/>
</w:rPr>
<w:t>All paragraph.</w:t>
</w:r>
</w:p>
</w:comment>

These things seem linked by the w:id="1" in both word/document.xml and word/comments.xml files.

It would be very interesting if your docx_extract_all_cmnts() function informs a tibble containing a column with the text associated with the comment.

test.docx.zip

Text outside of tables

I know the impetus of this package is to read data from .docx tables, but I am wondering if the xml structure would permit pulling text from beneath a specific heading. In a .docx with a common format, for example:

Introduction

Chicken ullamco meatball, magna tail elit meatloaf aliquip jerky cillum. Id chicken ut, meatloaf dolore jowl cupim porchetta aliqua tempor tenderloin sausage quis aute. Et deserunt est ground round, chicken ea do ball tip laboris tri-tip ullamco id occaecat chuck. Brisket cupim meatloaf veniam porchetta picanha meatball quis flank t-bone elit dolor rump.

Materials and Methods

Bacon ipsum dolor amet bacon dolore commodo id. Est veniam nostrud hamburger eu meatball nisi ut. Ham hock adipisicing anim aliqua ullamco. In ad cow flank meatball. Ut ham laboris incididunt pancetta do venison dolor fatback. Sint alcatra incididunt, shank sunt ground round commodo meatball tail filet mignon.

something like:
docx_extract_txt(doc, heading = "Introduction")

"Chicken ullamco meatball, magna ..."

returning a string of text. Not sure if this would be possible, but I think it could be extremely useful.
EDIT: I replaced "header" with "heading", as that seems to be more precise usage of what I'm after in MS Word parlance.

doc-file

I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").

Any idea how to read the old ".doc" format?
(I hope it's OK to post this as an issue. Just delete it if not^^)

Edit and upload comments to word docx

This is a great package which I use all the time to check feedback comments on student assignments.

Was wondering however if there is some way to edit a comment pulled through docx_extract_all_cmnts and then upload it back to the xml and write the word doc out? I am not sure if the xml holds any position (line and page) information so maybe it isn't possible. If there was it would be a great thing to have in the package to help fix comments without having to open up the word file and find it. Just a suggestion.

Thanks again for the work though!

Tables with track changes badly read

Hi,
Thank you for great package. I have document with tables. The document is under track control. When I read it it does not read correctly the values in table. Pls see attached file and example below. It should read "2" not "21"
docxtractr_bug.docx

Thanks for lookig into it!

Tomas


> library(docxtractr)
> path<-"C:\\Users\\tomas_hovorka\\Documents\\docxtractr_bug.docx"
> 
> d1<-read_docx(path)
> t1a<-docx_extract_tbl(d1, 1)
> t1a
# A tibble: 0 x 1
# ... with 1 variable: `21` <chr>
> 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Czech_Czech Republic.1250  LC_CTYPE=Czech_Czech Republic.1250    LC_MONETARY=Czech_Czech Republic.1250 LC_NUMERIC=C                         
[5] LC_TIME=Czech_Czech Republic.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] docxtractr_0.5.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16          utf8_1.1.4            crayon_1.3.4          dplyr_0.7.4           assertthat_0.1        R6_2.2.2              magrittr_1.5         
 [8] pillar_1.2.3          httr_1.3.1            rlang_0.2.0           rstudioapi_0.7.0-9000 bindrcpp_0.2          xml2_1.2.0            tools_3.3.1          
[15] glue_1.2.0            purrr_0.2.2           pkgconfig_2.0.1       bindr_0.1.1           tibble_1.4.2         
>

hrbrmstr / docxtractr Goto Github PK

docxtractr's Issues

Introduction

Materials and Methods

Recommend Projects

Recommend Topics

Recommend Org