icij / node-tika Goto Github PK

View Code? Open in Web Editor NEW

137.0 11.0 36.0 295.18 MB

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.

License: MIT License

Java 48.61% JavaScript 48.17% Makefile 3.22%

node-tika's People

Stargazers

Watchers

node-tika's Issues

unable to download java build files while installing the tika

while installing the tika, I am getting node gyp rebuild error which is strange. Changed the configs for python and changed MVS version to 2013 even than unable to install tika. Tried to install windows build tools but unable to install tools with the command

Upgrade Tika

If i want to use the newest tika version do i just have to put another jar in the folder or is there more work to do?

cannot find module tika

just copied the code from your example and get the following error:

module.js:529
throw err;
^

Error: Cannot find module 'tika'
at Function.Module._resolveFilename (module.js:527:15)
at Function.Module._load (module.js:476:23)
at Module.require (module.js:568:17)
at require (internal/module.js:11:18)
at Object. (/home/henrysachs/move-search/backend/node_approach/events.js:1:74)
at Module._compile (module.js:624:30)
at Object.Module._extensions..js (module.js:635:10)
at Module.load (module.js:545:32)
at tryModuleLoad (module.js:508:12)
at Function.Module._load (module.js:500:3)

Java dependency upgrade to version 0.6.0

With node v4.1.0 this package crashed because of node-java.
Check: joeferner/node-java#250

Small issue in documentation

According to documentation:

tika.text('http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf', 
    function(err, text, meta) {
        // ...
    });

should work. However, later in the documentation is stated that you need to call tika.extract to get also the metadata; tika.text feeds callback only with extracted text. (confirmed by an experiment :) )

Module did not self-register

I'm attempting to use node-tika in a project that builds/tests via CircleCI. Their CI environment installs various things for me, but when my server attempts to start it fails as follows:

[17:54:45] Using gulpfile ~/ow-back/gulpfile.js
[17:54:45] Starting 'syncdb'...
[17:54:45] Finished 'syncdb' after 288 ms
[17:54:45] Starting 'serve'...
[17:54:46] 'serve' errored after 240 ms
[17:54:46] Error: Module did not self-register.
    at Error (native)
    at Object.Module._extensions..node (module.js:450:18)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:313:12)
    at Module.require (module.js:366:17)
    at require (module.js:385:17)
    at Object.<anonymous> (/home/ubuntu/ow-back/node_modules/tika/node_modules/java/lib/nodeJavaBridge.js:10:16)
    at Module._compile (module.js:425:26)
    at Object.Module._extensions..js (module.js:432:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:313:12)
    at Module.require (module.js:366:17)
    at require (module.js:385:17)
    at Object.<anonymous> (/home/ubuntu/ow-back/node_modules/tika/node_modules/java/index.js:2:18)
    at Module._compile (module.js:425:26)

CircleCI appears to be running ubuntu, and JAVA_HOME=/usr/lib/jvm/jdk1.7.0 to give you a sense of my java version.

It appears this is some kind of java incompatibility issue, but what are the troubleshooting steps associated with fixing this?

Question: Is there a way to get text without placeholders?

I am trying to parse a html document for getting it indexed into elasticsearch. The text comes back with placeholders like [image:] etc. Is there a n option to get text back without these placeholders?

Tika 1.9

Version 1.9 is already released. 👍

cannot extract text from scanned PDF

I am trying to extract text from scanned pdf documents. It works fine for most of them except a couple I tested.
I am able to extract the metadata correctly but not the text in the pdf. It returns with a blank set of lines for the text part.
Are there any specific pdf versions or some other criteria that can cause this issue? Does it have anything to do with the pdf producer which in this case is Haru Free PDF Library 2.0.8?

Inconsistent naming of PDF options

The readme file lists PDF options with a pdf prefix, however; the fillPdfOptions() method looks for options without the prefix. I would be happy to make a PR, I just need to know if you want to update the README or the option parser.

Thanks!

Return XHTML content as well (the Tika default)

It would be awesome, if the bridge could not only deliver plain text, but as well the XHTML that can be generated by the Tika default configuration :-)

HTML Extraction

Hi there-

I'd like to use the BoilerPipeContentHandler to only extract body text from an HTML page. Can anyone suggest a way to make this happen. I don't know much Java so I'm not sure where to even start.

http://stackoverflow.com/questions/23653061/how-to-extract-main-text-from-html-using-tika

Thanks!
Alex

Clarification: Does node-tika require JDK 1.7 or above, or only JDK 1.7?

I am on Mac OS X Sierra. I tried using the "jabba" jvm version manager so I could map to 1.7 before building node-tika, but got errors.

Now the npm install worked, but I get the message in the screen shot I have included.

Suggestions?

Module version mismatch - Electron app

I'm trying to get Tika module to work in an electron app, but as soon as I require the module, I get this error:
Uncaught Error: Module version mismatch. Expected 47, got 46.
I am using nodeVersion v4.2.1 and npmVersion 2.14.7

Need help making tika work in AWS Lambda

Our scenario is to get .pdf files uploaded in AWS S3 storage and process it later. We want to move to AWS Lambda. However, Lambda requires that the entire package (along with all node_modules) be uploaded as a zip file (i.e. it wont run npm install). This means that tika picks up whatever java path that the local machine happened to have and save it in jvm_dll_path.json. The path to libjvm.so is different on the Lambda machine, and loading the module fails with "libjvm.so: cannot open shared object file: No such file or directory".
I tried just replacing the string in jvm_dll_path.json with the correct AWS path, but no dice.
Really appreciate any help to make this work on Lambda.
Thanks!

icij / node-tika Goto Github PK

node-tika's People

Stargazers

Watchers

Forkers

node-tika's Issues

Recommend Projects

Recommend Topics

Recommend Org