Giter Site home page Giter Site logo

sajari / docconv Goto Github PK

View Code? Open in Web Editor NEW
1.5K 43.0 221.0 1.66 MB

Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

License: MIT License

Go 98.43% Dockerfile 0.87% Shell 0.70%
rtf docx xml html rtf-files docs conversion pdf pdf-converter word

docconv's People

Contributors

agcom avatar dependabot[bot] avatar dhowden avatar guilhermebr avatar helenamariano avatar jkaho avatar jonathaningram avatar jsok avatar jupenur avatar justinkoke avatar marioidival avatar mish15 avatar onemartini avatar pixge avatar rhaist avatar senekor avatar testwill avatar xiaoxin01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docconv's Issues

use go mod support

I use go mod in mine project. and import code.sajari.com/docconv
then I run

 go mod tidy 

the output

  code.sajari.com/docconv imports
        github.com/otiai10/gosseract/v1/gosseract: module github.com/otiai10/gosseract@latest found (v2.2.1+incompatible), but 
does not contain package github.com/otiai10/gosseract/v1/gosseract

and I run that command

go list  -m -versions github.com/otiai10/gosseract

it tell me that.

github.com/otiai10/gosseract v2.1.0+incompatible v2.2.0+incompatible v2.2.1+incompatible

I saw the source code ,found that v2.0 tag contains v1/gosseract code.so any one help me ?
uname -a

Linux funky 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

go version

go version go1.14.3 linux/amd64

adding dependency to docconv on go 1.11 fails

$ go get code.sajari.com/docconv/
go: finding code.sajari.com/docconv v1.0.0
go: downloading code.sajari.com/docconv v1.0.0
go: finding github.com/JalfResi/justext latest
go: downloading github.com/JalfResi/justext v0.0.0-20170829062021-c0282dea7198
go: finding github.com/levigross/exp-html latest
go: downloading github.com/levigross/exp-html v0.0.0-20120902181939-8df60c69a8f5
build code.sajari.com/docconv: cannot find module for path code.google.com/p/go.net/html
$

Improve installation instructions for non-Go programmers

I know a little bit about computers. But I'm not a Go developer. Thus this:

Compile the binary. Check the binary is executable and then launch as per above with relevant flag settings.

is not all that helpful to me. I have tried:

mkdir docconv
cd docconv
GOPATH=/Users/pjlsergeant/dev/docconv go get https://github.com/sajari/docconv/issues/new
cd github.com/sajari/docconv/
GOPATH=/Users/pjlsergeant/dev/docconv go install

And I don't appear to have generated any binaries (and no errors, either).

I'm sure I will be able to Google and solve this eventually, but I wonder if the two or three lines of magic invocation needed could be included under Installation Instructions?

Can't build with ocr tag

I can't seem to build this with the ocr tag. I'm getting this message:

[<username>@<machine> <dir>]$ go get -tags ocr code.sajari.com/docconv/...
go: finding module for package github.com/otiai10/gosseract/v1/gosseract
go: finding module for package github.com/otiai10/gosseract/v1/gosseract
<home_dir>/go/pkg/mod/code.sajari.com/[email protected]/image_ocr.go:10:2: no matching versions for query "latest"

I was under the impression gosseract was on v2, so I'm not sure why v1 is mentioned.

Compatibility with Windows

Docconv seems to give trouble when running on Windows computer. In doc.go for instance there is a hardcoded path to a tempdir which includes a forward slash (line 17 and 60 e.g.). This is definitely a problem for a Windows OS...

Split text by pages

Does it possible to split text by pages of the original document ?

Example:

text from docconv -> Page1/n/n1/n/nMy Name/n is Page2/n/n2/n/n...
splited text -> ["Page1", "My Name is Page2, "...",...]

Use as internal go library

Is there a way to use this as an interal/embedded library in a golang program? I see that I can most likely achieve this with ODT and several other types since it returns a string.

However, with PDF it returns a BodyResult that is not exported and looks like it is interpreted directly by the command line? Am I missing something?

Error occured while commond "go mod tidy"

go: finding module for package github.com/otiai10/gosseract/v1/gosseract
Crawler/module/crawler imports
        code.sajari.com/docconv imports
        github.com/otiai10/gosseract/v1/gosseract: module github.com/otiai10/gosseract@latest found (v2.2.1+incompatible), but does not contain package
github.com/otiai10/gosseract/v1/gosseract

Installation on Windows

Hello,

Anyone could help me? how to use this lib on a Windows machine? because it needs to install the dependencies.
Any tutorial? thank you

issues with deploying to gcloud

I tried to deploy it to appengine, and i'm failing.


---------------------------------------------------------------------------------------------------------------- REMOTE BUILD OUTPUT -----------------------------------------------------------------------------------------------------------------
starting build "b30eafdb-e2e2-4a5d-b334-6be6535a8773"

FETCHSOURCE
Fetching storage object: gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426
Copying gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426...
/ [1 files][  6.4 MiB/  6.4 MiB]
Operation completed over 1 objects/6.4 MiB.
BUILD
Already have image (with digest): gcr.io/cloud-builders/docker
Sending build context to Docker daemon  13.49MB
Step 1/9 : FROM alpine
latest: Pulling from library/alpine
Digest: sha256:acd3ca9941a85e8ed16515bfc5328e4e2f8c128caa72959a58a127b7801ee01f
Status: Downloaded newer image for alpine:latest
 ---> 961769676411
Step 2/9 : MAINTAINER Hamish Ogilvy
 ---> Running in 1bf83502be90
Removing intermediate container 1bf83502be90
 ---> 1d609d316173
Step 3/9 : ENV CC=/usr/bin/gcc
 ---> Running in ebc84ab28e30
Removing intermediate container ebc84ab28e30
 ---> 857461fa7c94
Step 4/9 : ENV CXX=/usr/bin/g++
 ---> Running in 7022ffdf3aa6
Removing intermediate container 7022ffdf3aa6
 ---> e6a37cfd4e07
Step 5/9 : COPY dependencies/* /
COPY failed: no source files were specified
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: exit status 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR: (gcloud.app.deploy) Cloud build failed. Check logs at https://console.cloud.google.com/gcr/builds/xxx?project=xxx Failure status: UNKNOWN: Error Response: [2] Build failed; check build logs for details
rm: appengine/dependencies: No such file or directory

Unable to parse .doc file

When I read the .doc file from the local directory, I get a goruntime error

Below is my code

`package main
import (
	"code.sajari.com/docconv"
	"fmt"
	"log"
	"os"
)
func main() {
	f, err := os.Open("C:\\Users\\94417\\Desktop\\go\\20210913test.doc")
	if err != nil {
		log.Fatalf("got error = %v, want nil", err)
	}
	//println(f)
	res, _, err := docconv.ConvertDoc(f)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
	f.Close()

pdftotext -layout

Hello!
I woud use -layout option of pdftotext for that I guess I have to change
body, err := exec.Command("pdftotext", "-q", "-nopgbrk", "-enc", "UTF-8", "-eol", "unix", f.Name(), "-").Output() to add -layout am I correct?

docconv appears to be out of sync with latest github.com/otiai10/gosseract

With current docconv at github/sajari/docconv that imports github.com/otiai10/gosseract/v1/gosseract the build on Mac OS terminal results in:

go get -tags ocr code.sajari.com/docconv/...
package github.com/otiai10/gosseract/v1/gosseract: cannot find package "github.com/otiai10/gosseract/v1/gosseract" in any of:
/usr/local/go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOROOT)
/Users//go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOPATH)

Changing import to reference otiai10's current release (i.e. "github.com/otiai10/gosseract/v1/gosseract") in image_orc.go results in undefined references as follows:

go get -tags ocr code.sajari.com/docconv/...

code.sajari.com/docconv

/Users//go/src/code.sajari.com/docconv/image_ocr.go:35:11: undefined: gosseract.Must
/Users//go/src/code.sajari.com/docconv/image_ocr.go:35:26: undefined: gosseract.Params

wvText: exec: "wvText": executable file not found in $PATH

yum install poppler-utils wv unrtf tidy all dependencies

res, err := docconv.ConvertPath("/opt/test/测试.doc")

[root@izwz9594gzu20j2z7lneucz test]# ./main
2020/11/03 10:14:15 wvText: exec: "wvText": executable file not found in $PATH
2020/11/03 10:14:15 wvSummary: exec: "wvSummary": executable file not found in $PATH
2020/11/03 10:14:15 error converting data: error unzipping data: zip: not a valid zip file

Possible license inconsistencies

Hello,

We were considering using your library as part of our application and discovered one potential license inconsistency:

Your library is licensed under MIT and has poppler-utils in the dependencies. However, poppler is licensed under GPL 2.

License information:
https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/README.md
https://pkgs.alpinelinux.org/package/edge/main/x86/poppler-utils

In our understanding, it could make your library obligatory to be licensed under GPL 2. I'm not the license expert and I might be mistaken here. But I hope you find this observation helpful, and you might have already considered it, and there are reasons why it's still fine to use MIT. It'd be great if you can clarify it, and explain us the legal way to use your library and poppler in our app under MIT.

Thank you in advance!

Add tesseract support in docker images

Hi,

Very good work with this project, super userful!

Is it possible to add support for tesseract in the docker images?

Just asking to know if I directly fork it or just wait a bit

Thanks!

using the library in windows platform

I am using the library in Linux without any problems. But when I compile the same code in Windows, the following error arises:

error converting data: exec: "pdftotext": executable file not found in %PATH%

Is there any way to install the dependencies in Windows?

I am using go version 1.17.

wvText: exit status 255

I convert the word export from 2007 to 2003 ,that is work right, but I convert the 2003 err

2020/11/03 11:02:46 wvText: exit status 255
2020/11/03 11:02:46 error converting data: error unzipping data: zip: not a valid zip file

system deadlock

Line 102 103 in this doc. go causes a system deadlock, mainly because the coroutine implemented above failed to add valid data to the channel

	body := <-bc
	meta := <-mc

err:

ConvertDoc: could not read doc: mscfb: bad signature; 43016997712
wvText: exit status 255

docx

How i can convert docx to html?

"github.com/advancedlogic/GoOse" has changed ExtractFromURL function, update needed

url.go files needs to look like this to work

package docconv

import (
	"bytes"
	"io"

	"github.com/advancedlogic/GoOse"
)

// Convert URL
func ConvertURL(input io.Reader, readability bool) (string, map[string]string, error) {
	meta := make(map[string]string)

	buf := new(bytes.Buffer)
	_, err := buf.ReadFrom(input)
	if err != nil {
		return "", nil, err
	}

	g := goose.New()
	article, err := g.ExtractFromURL(buf.String()) <-- this is changed
	if err != nil {
		return "", nil, err
	}

	meta["title"] = article.Title
	meta["description"] = article.MetaDescription
	meta["image"] = article.TopImage

	return article.CleanedText, meta, nil
}

Add govendor as dependency manager

What do you think about using govendor as default dependency manager for the project?

Right now, the only external dependency existing is justext, but in #26 I've added a new dependency to deal with xlsx files, in the future, new dependencies may be added for dealing with new file types, and this could be a good approach for handling the dependency management.

error converting data: exec: "pdftotext": executable file not found in $PATH

I'm trying to launch simple code from tutorial in this repo, only with my own PDF file and has this error error converting data: exec: "pdftotext": executable file not found in $PATH.

Platform MacOS. My PDF file is in go/src/project and in go/bin

My Go project file path: /User/admin/go/src/project

.bash_profile:

export GOPATH=$HOME/go
export GOBIN=$GOPATH/bin
export PATH=$PATH:/usr/local/go/bin

Code:

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("gsl-mit-edu-0to1.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

What could be the problem ?

Fix temporary file creation

The following line throws an error message for me:
f, err := ioutil.TempFile(os.TempDir(), "/docconv")

error message:
pattern contains path separator

full error message thrown by docconv.Convert():
error converting data: error creating local file: error creating temporary file: pattern contains path separator

This can be easily reproduced with go playground, just add and remove the slash to see the difference:
https://play.golang.org/p/VaCO9evqKzM

I suppose it could be fixed by just removing the slash, unless I'm missing the reason it needed to be there in the first place.

Unable to parse HTML

So I was trying to parse content from multiple document formats and turns out it works for other document formats pdf, doc etc. but not for html files somehow

below is the minimal example with sample html

main.go

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	// Attempt to read file
	txt, err := docconv.ConvertPath("test.html")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(txt.Body)
}

test.html

<!DOCTYPE html>
<html>
  <body>
    <h1>This is heading 1</h1>
    <h2>This is heading 2</h2>
    <h3>This is heading 3</h3>
    <h4>This is heading 4</h4>
    <h5>This is heading 5</h5>
    <h6>This is heading 6</h6>
  </body>
</html>

As of now output is blank

also I noticed that there's no release from 2019 feb so code.sajari.com might be sending older library is there any way to maybe pre-release? version or configure CI to do that

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.