icij / node-tika Goto Github PK
View Code? Open in Web Editor NEWApache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
License: MIT License
Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
License: MIT License
while installing the tika, I am getting node gyp rebuild error which is strange. Changed the configs for python and changed MVS version to 2013 even than unable to install tika. Tried to install windows build tools but unable to install tools with the command
If i want to use the newest tika version do i just have to put another jar in the folder or is there more work to do?
just copied the code from your example and get the following error:
module.js:529
throw err;
^
Error: Cannot find module 'tika'
at Function.Module._resolveFilename (module.js:527:15)
at Function.Module._load (module.js:476:23)
at Module.require (module.js:568:17)
at require (internal/module.js:11:18)
at Object. (/home/henrysachs/move-search/backend/node_approach/events.js:1:74)
at Module._compile (module.js:624:30)
at Object.Module._extensions..js (module.js:635:10)
at Module.load (module.js:545:32)
at tryModuleLoad (module.js:508:12)
at Function.Module._load (module.js:500:3)
With node v4.1.0 this package crashed because of node-java.
Check: joeferner/node-java#250
According to documentation:
tika.text('http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf',
function(err, text, meta) {
// ...
});
should work. However, later in the documentation is stated that you need to call tika.extract
to get also the metadata; tika.text
feeds callback only with extracted text. (confirmed by an experiment :) )
I'm attempting to use node-tika in a project that builds/tests via CircleCI. Their CI environment installs various things for me, but when my server attempts to start it fails as follows:
[17:54:45] Using gulpfile ~/ow-back/gulpfile.js
[17:54:45] Starting 'syncdb'...
[17:54:45] Finished 'syncdb' after 288 ms
[17:54:45] Starting 'serve'...
[17:54:46] 'serve' errored after 240 ms
[17:54:46] Error: Module did not self-register.
at Error (native)
at Object.Module._extensions..node (module.js:450:18)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:313:12)
at Module.require (module.js:366:17)
at require (module.js:385:17)
at Object.<anonymous> (/home/ubuntu/ow-back/node_modules/tika/node_modules/java/lib/nodeJavaBridge.js:10:16)
at Module._compile (module.js:425:26)
at Object.Module._extensions..js (module.js:432:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:313:12)
at Module.require (module.js:366:17)
at require (module.js:385:17)
at Object.<anonymous> (/home/ubuntu/ow-back/node_modules/tika/node_modules/java/index.js:2:18)
at Module._compile (module.js:425:26)
CircleCI appears to be running ubuntu, and JAVA_HOME=/usr/lib/jvm/jdk1.7.0
to give you a sense of my java version.
It appears this is some kind of java incompatibility issue, but what are the troubleshooting steps associated with fixing this?
I am trying to parse a html document for getting it indexed into elasticsearch. The text comes back with placeholders like [image:] etc. Is there a n option to get text back without these placeholders?
Version 1.9 is already released. ๐
I am trying to extract text from scanned pdf documents. It works fine for most of them except a couple I tested.
I am able to extract the metadata correctly but not the text in the pdf. It returns with a blank set of lines for the text part.
Are there any specific pdf versions or some other criteria that can cause this issue? Does it have anything to do with the pdf producer which in this case is Haru Free PDF Library 2.0.8?
The readme file lists PDF options with a pdf
prefix, however; the fillPdfOptions()
method looks for options without the prefix. I would be happy to make a PR, I just need to know if you want to update the README or the option parser.
Thanks!
It would be awesome, if the bridge could not only deliver plain text, but as well the XHTML that can be generated by the Tika default configuration :-)
Hi there-
I'd like to use the BoilerPipeContentHandler to only extract body text from an HTML page. Can anyone suggest a way to make this happen. I don't know much Java so I'm not sure where to even start.
http://stackoverflow.com/questions/23653061/how-to-extract-main-text-from-html-using-tika
Thanks!
Alex
I'm trying to get Tika module to work in an electron app, but as soon as I require the module, I get this error:
Uncaught Error: Module version mismatch. Expected 47, got 46.
I am using nodeVersion v4.2.1
and npmVersion 2.14.7
Our scenario is to get .pdf files uploaded in AWS S3 storage and process it later. We want to move to AWS Lambda. However, Lambda requires that the entire package (along with all node_modules) be uploaded as a zip file (i.e. it wont run npm install). This means that tika picks up whatever java path that the local machine happened to have and save it in jvm_dll_path.json. The path to libjvm.so is different on the Lambda machine, and loading the module fails with "libjvm.so: cannot open shared object file: No such file or directory".
I tried just replacing the string in jvm_dll_path.json with the correct AWS path, but no dice.
Really appreciate any help to make this work on Lambda.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.