sajari / docconv Goto Github PK
View Code? Open in Web Editor NEWConverts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text
License: MIT License
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text
License: MIT License
how can i set a params to remove page header ?
I use go mod in mine project. and import code.sajari.com/docconv
then I run
go mod tidy
the output
code.sajari.com/docconv imports
github.com/otiai10/gosseract/v1/gosseract: module github.com/otiai10/gosseract@latest found (v2.2.1+incompatible), but
does not contain package github.com/otiai10/gosseract/v1/gosseract
and I run that command
go list -m -versions github.com/otiai10/gosseract
it tell me that.
github.com/otiai10/gosseract v2.1.0+incompatible v2.2.0+incompatible v2.2.1+incompatible
I saw the source code ,found that v2.0 tag contains v1/gosseract code.so any one help me ?
uname -a
Linux funky 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
go version
go version go1.14.3 linux/amd64
$ go get code.sajari.com/docconv/
go: finding code.sajari.com/docconv v1.0.0
go: downloading code.sajari.com/docconv v1.0.0
go: finding github.com/JalfResi/justext latest
go: downloading github.com/JalfResi/justext v0.0.0-20170829062021-c0282dea7198
go: finding github.com/levigross/exp-html latest
go: downloading github.com/levigross/exp-html v0.0.0-20120902181939-8df60c69a8f5
build code.sajari.com/docconv: cannot find module for path code.google.com/p/go.net/html
$
I know a little bit about computers. But I'm not a Go developer. Thus this:
Compile the binary. Check the binary is executable and then launch as per above with relevant flag settings.
is not all that helpful to me. I have tried:
mkdir docconv
cd docconv
GOPATH=/Users/pjlsergeant/dev/docconv go get https://github.com/sajari/docconv/issues/new
cd github.com/sajari/docconv/
GOPATH=/Users/pjlsergeant/dev/docconv go install
And I don't appear to have generated any binaries (and no errors, either).
I'm sure I will be able to Google and solve this eventually, but I wonder if the two or three lines of magic invocation needed could be included under Installation Instructions?
I can't seem to build this with the ocr tag. I'm getting this message:
[<username>@<machine> <dir>]$ go get -tags ocr code.sajari.com/docconv/...
go: finding module for package github.com/otiai10/gosseract/v1/gosseract
go: finding module for package github.com/otiai10/gosseract/v1/gosseract
<home_dir>/go/pkg/mod/code.sajari.com/[email protected]/image_ocr.go:10:2: no matching versions for query "latest"
I was under the impression gosseract was on v2, so I'm not sure why v1 is mentioned.
Docconv seems to give trouble when running on Windows computer. In doc.go for instance there is a hardcoded path to a tempdir which includes a forward slash (line 17 and 60 e.g.). This is definitely a problem for a Windows OS...
tidy returns blank, needs to be fixed.
Has the benefits of:
a) not needing tidy
b) not needing custom tokenization
c) provides the key image from the page also
wvHtml and wvText are both returning an ok response, but the output from Html2Text() virtually removes all the content in some cases.
Does it possible to split text by pages of the original document ?
Example:
text from docconv -> Page1/n/n1/n/nMy Name/n is Page2/n/n2/n/n...
splited text -> ["Page1", "My Name is Page2, "...",...]
It's easier to test and can be used in other places if there is an option to just convert a file from the command line.
Is there a way to use this as an interal/embedded library in a golang program? I see that I can most likely achieve this with ODT and several other types since it returns a string.
However, with PDF it returns a BodyResult that is not exported and looks like it is interpreted directly by the command line? Am I missing something?
go: finding module for package github.com/otiai10/gosseract/v1/gosseract
Crawler/module/crawler imports
code.sajari.com/docconv imports
github.com/otiai10/gosseract/v1/gosseract: module github.com/otiai10/gosseract@latest found (v2.2.1+incompatible), but does not contain package
github.com/otiai10/gosseract/v1/gosseract
Hello,
Anyone could help me? how to use this lib on a Windows machine? because it needs to install the dependencies.
Any tutorial? thank you
Hi team, any plans to support popular reading format epub ?
Image to text (used OCR), recommend gosseract
[1]
I tried to deploy it to appengine, and i'm failing.
---------------------------------------------------------------------------------------------------------------- REMOTE BUILD OUTPUT -----------------------------------------------------------------------------------------------------------------
starting build "b30eafdb-e2e2-4a5d-b334-6be6535a8773"
FETCHSOURCE
Fetching storage object: gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426
Copying gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426...
/ [1 files][ 6.4 MiB/ 6.4 MiB]
Operation completed over 1 objects/6.4 MiB.
BUILD
Already have image (with digest): gcr.io/cloud-builders/docker
Sending build context to Docker daemon 13.49MB
Step 1/9 : FROM alpine
latest: Pulling from library/alpine
Digest: sha256:acd3ca9941a85e8ed16515bfc5328e4e2f8c128caa72959a58a127b7801ee01f
Status: Downloaded newer image for alpine:latest
---> 961769676411
Step 2/9 : MAINTAINER Hamish Ogilvy
---> Running in 1bf83502be90
Removing intermediate container 1bf83502be90
---> 1d609d316173
Step 3/9 : ENV CC=/usr/bin/gcc
---> Running in ebc84ab28e30
Removing intermediate container ebc84ab28e30
---> 857461fa7c94
Step 4/9 : ENV CXX=/usr/bin/g++
---> Running in 7022ffdf3aa6
Removing intermediate container 7022ffdf3aa6
---> e6a37cfd4e07
Step 5/9 : COPY dependencies/* /
COPY failed: no source files were specified
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: exit status 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR: (gcloud.app.deploy) Cloud build failed. Check logs at https://console.cloud.google.com/gcr/builds/xxx?project=xxx Failure status: UNKNOWN: Error Response: [2] Build failed; check build logs for details
rm: appengine/dependencies: No such file or directory
The Automatic Numbering produced from Word (docx) gets dropped when using ConvertPath.
Hi there! 😊
This repo seems to depend on github.com/JalfResi/justext
which in turn depends on github.com/levigross/exp-html
— which doesn't ship a license file. This was identified in our CI pipeline using github.com/google/go-licenses
.
To me this looks like a license violation and should be fixed somehow.
Reported at the github.com/JalfResi/justext
repo here: JalfResi/justext#34
When I read the .doc file from the local directory, I get a goruntime error
Below is my code
`package main
import (
"code.sajari.com/docconv"
"fmt"
"log"
"os"
)
func main() {
f, err := os.Open("C:\\Users\\94417\\Desktop\\go\\20210913test.doc")
if err != nil {
log.Fatalf("got error = %v, want nil", err)
}
//println(f)
res, _, err := docconv.ConvertDoc(f)
if err != nil {
log.Fatal(err)
}
fmt.Println(res)
f.Close()
Currently files with bad extensions can fail parsing. We should try to detect the mimetype on failure.
Hello!
I woud use -layout
option of pdftotext for that I guess I have to change
body, err := exec.Command("pdftotext", "-q", "-nopgbrk", "-enc", "UTF-8", "-eol", "unix", f.Name(), "-").Output()
to add -layout
am I correct?
Can we easily support the .pages
extension?
With current docconv at github/sajari/docconv that imports github.com/otiai10/gosseract/v1/gosseract the build on Mac OS terminal results in:
go get -tags ocr code.sajari.com/docconv/...
package github.com/otiai10/gosseract/v1/gosseract: cannot find package "github.com/otiai10/gosseract/v1/gosseract" in any of:
/usr/local/go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOROOT)
/Users//go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOPATH)
Changing import to reference otiai10's current release (i.e. "github.com/otiai10/gosseract/v1/gosseract") in image_orc.go results in undefined references as follows:
go get -tags ocr code.sajari.com/docconv/...
/Users//go/src/code.sajari.com/docconv/image_ocr.go:35:11: undefined: gosseract.Must
/Users//go/src/code.sajari.com/docconv/image_ocr.go:35:26: undefined: gosseract.Params
yum install poppler-utils wv unrtf tidy all dependencies
res, err := docconv.ConvertPath("/opt/test/测试.doc")
[root@izwz9594gzu20j2z7lneucz test]# ./main
2020/11/03 10:14:15 wvText: exec: "wvText": executable file not found in $PATH
2020/11/03 10:14:15 wvSummary: exec: "wvSummary": executable file not found in $PATH
2020/11/03 10:14:15 error converting data: error unzipping data: zip: not a valid zip file
Hello,
We were considering using your library as part of our application and discovered one potential license inconsistency:
Your library is licensed under MIT and has poppler-utils in the dependencies. However, poppler is licensed under GPL 2.
License information:
https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/README.md
https://pkgs.alpinelinux.org/package/edge/main/x86/poppler-utils
In our understanding, it could make your library obligatory to be licensed under GPL 2. I'm not the license expert and I might be mistaken here. But I hope you find this observation helpful, and you might have already considered it, and there are reasons why it's still fine to use MIT. It'd be great if you can clarify it, and explain us the legal way to use your library and poppler in our app under MIT.
Thank you in advance!
Hi,
Very good work with this project, super userful!
Is it possible to add support for tesseract in the docker images?
Just asking to know if I directly fork it or just wait a bit
Thanks!
unrtf
and tidy
using fink install xx
.
poppler
and wv
using brew install xx
.I am using the library in Linux without any problems. But when I compile the same code in Windows, the following error arises:
error converting data: exec: "pdftotext": executable file not found in %PATH%
Is there any way to install the dependencies in Windows?
I am using go version 1.17.
these tidy, wv, popplerutils, unrtf
Hi,
Is there any plan to support pptx files? Their format is very similar。
This is not an issue. How can I get the content of the document per page number. Thank you.
new hosting: https://github.com/rogpeppe/go-charset
$ go get github.com/sajari/docconv
warning: code.google.com is shutting down; import path code.google.com/p/go-charset/charset will stop working
I convert the word export from 2007 to 2003 ,that is work right, but I convert the 2003 err
2020/11/03 11:02:46 wvText: exit status 255
2020/11/03 11:02:46 error converting data: error unzipping data: zip: not a valid zip file
Line 102 103 in this doc. go causes a system deadlock, mainly because the coroutine implemented above failed to add valid data to the channel
body := <-bc
meta := <-mc
err:
ConvertDoc: could not read doc: mscfb: bad signature; 43016997712
wvText: exit status 255
Generate pdf scanned (image inside of the pdf)
bootstrap: nuveo@8729c97
The issue seems to arise from rtf.go lines 31-33:
if len(line) > 4 && line[:4] != "### " {
output += line + "\n"
}
How i can convert docx to html?
url.go files needs to look like this to work
package docconv
import (
"bytes"
"io"
"github.com/advancedlogic/GoOse"
)
// Convert URL
func ConvertURL(input io.Reader, readability bool) (string, map[string]string, error) {
meta := make(map[string]string)
buf := new(bytes.Buffer)
_, err := buf.ReadFrom(input)
if err != nil {
return "", nil, err
}
g := goose.New()
article, err := g.ExtractFromURL(buf.String()) <-- this is changed
if err != nil {
return "", nil, err
}
meta["title"] = article.Title
meta["description"] = article.MetaDescription
meta["image"] = article.TopImage
return article.CleanedText, meta, nil
}
What do you think about using govendor as default dependency manager for the project?
Right now, the only external dependency existing is justext, but in #26 I've added a new dependency to deal with xlsx files, in the future, new dependencies may be added for dealing with new file types, and this could be a good approach for handling the dependency management.
I'm trying to launch simple code from tutorial in this repo, only with my own PDF file and has this error error converting data: exec: "pdftotext": executable file not found in $PATH
.
Platform MacOS. My PDF file is in go/src/project
and in go/bin
My Go project file path: /User/admin/go/src/project
.bash_profile:
export GOPATH=$HOME/go
export GOBIN=$GOPATH/bin
export PATH=$PATH:/usr/local/go/bin
Code:
package main
import (
"fmt"
"log"
"code.sajari.com/docconv"
)
func main() {
res, err := docconv.ConvertPath("gsl-mit-edu-0to1.pdf")
if err != nil {
log.Fatal(err)
}
fmt.Println(res)
}
What could be the problem ?
The following line throws an error message for me:
f, err := ioutil.TempFile(os.TempDir(), "/docconv")
error message:
pattern contains path separator
full error message thrown by docconv.Convert()
:
error converting data: error creating local file: error creating temporary file: pattern contains path separator
This can be easily reproduced with go playground, just add and remove the slash to see the difference:
https://play.golang.org/p/VaCO9evqKzM
I suppose it could be fixed by just removing the slash, unless I'm missing the reason it needed to be there in the first place.
So I was trying to parse content from multiple document formats and turns out it works for other document formats pdf
, doc
etc. but not for html files somehow
below is the minimal example with sample html
main.go
package main
import (
"fmt"
"log"
"code.sajari.com/docconv"
)
func main() {
// Attempt to read file
txt, err := docconv.ConvertPath("test.html")
if err != nil {
log.Fatal(err)
}
fmt.Println(txt.Body)
}
test.html
<!DOCTYPE html>
<html>
<body>
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6>
</body>
</html>
As of now output is blank
also I noticed that there's no release from 2019 feb so code.sajari.com
might be sending older library is there any way to maybe pre-release? version or configure CI to do that
The backup parsing works pretty well, except for random elements like <fb:comments>
etc, so it's probably good enough.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.