tomtomgo / phearjs Goto Github PK
View Code? Open in Web Editor NEWPhearJS - render dynamic Javascript webpages to JSON with PhantomJS
Home Page: http://phear.io
PhearJS - render dynamic Javascript webpages to JSON with PhantomJS
Home Page: http://phear.io
It's working perfect, but short time. After few hours it's stoping to hear on 8100 port. Say me please how can I fix in config file? I need service that working without stoping. Thanks a lot for autor.
phearjs is effectively incompatible with Windows because of the memcache dependency. It may be desirable to have a config flag that removes the memcached dependency when running phearjs for Windows users who want the scraping features without caching capabilities.
P.S I can attempt to make a PR for this but it might take 1-2 weeks to find a good time for it. Let me know if I should.
One first... I want to say for users who want parsing web sites " Use https://github.com/Tomtomgo/phearjs ". It's really great when any wget, xidel and also...phantomjs.
Of course phearjs ussing phantomjs, like phantomjs ussing core linux )
But in a production when you need get information from more 1M web information - phearjs is best.
Thanks for autor.
Now, last question "how turn off download images like in phantomjs?"
(phantomjs --load-images=false )
I put phantomjs, memcached, nodejs in a jessie box and installed as described.
root@6c54acf2dec0:/tmp/phearjs# node phear.js
2017-06-02 12:51:14 [ phear:8100] Starting Phear...
2017-06-02 12:51:14 [ phear:8100] ==================================
2017-06-02 12:51:14 [ phear:8100] Version: 0.6.2
2017-06-02 12:51:14 [ phear:8100] Mode: development
2017-06-02 12:51:14 [ phear:8100] Config file: ./config/config.json
2017-06-02 12:51:14 [ phear:8100] Port: 8100
2017-06-02 12:51:14 [ phear:8100] Workers: 4
2017-06-02 12:51:14 [ phear:8100] ==================================
2017-06-02 12:51:14 [ phear:8100] Worker 1 of 4 started.
2017-06-02 12:51:14 [ phear:8100] Worker 2 of 4 started.
2017-06-02 12:51:14 [ phear:8100] Worker 3 of 4 started.
2017-06-02 12:51:14 [ phear:8100] Worker 4 of 4 started.
2017-06-02 12:51:14 [ phear:8100] Phear started.
2017-06-02 12:51:14 [ phear:8100] Memcache failed: Error: connect ECONNREFUSED 127.0.0.1:11211
2017-06-02 12:51:14 [ phear:8100] Trying to kill process and 4 workers gently...
Terminated
Who starts memcached? Am I to do that? I thought phear would do it. Or what is the issue here?
I switched to docker because I am on boot2docker, so there is no way to to install phantomjs here, hence I cannot test as described. It should work inside my docker container, though.
What I have here:
root@sv2:~/phearjs# cat /etc/issue
Debian GNU/Linux 8 \n \l
root@sv2:~/phearjs# cat /etc/debian_version
8.4
root@sv2:~/phearjs# uname -m
x86_64
root@sv2:~/phearjs# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 8.4 (jessie)
Release: 8.4
Codename: jessie
root@sv2:~/phearjs# nodejs -v
v0.10.29
root@sv2:~/phearjs# npm -v
1.4.21
root@sv2:~/phearjs# memcached -help | grep memcached
memcached 1.4.21
root@sv2:~/phearjs# ls
assets build CHANGELOG.md config gulpfile.js INSTALLATION.md lib node_modules npm-debug.log package.json phear.js README.md restart_worker.sh src
How I am trying to install:
root@sv2:~/phearjs# npm install
and got
npm WARN package.json [email protected] No README data
npm WARN deprecated [email protected]: Jade has been renamed to pug, please install the latest version of pug instead of jade
\
> [email protected] install /root/phearjs/node_modules/usage
> node-gyp rebuild
make: Entering directory '/root/phearjs/node_modules/usage/build'
CXX(target) Release/obj.target/sysinfo/src/binding.o
In file included from ../src/binding.cpp:2:0:
../node_modules/nan/nan.h:324:47: error: ‘REPLACE_INVALID_UTF8’ is not a member of ‘v8::String’
static const unsigned kReplaceInvalidUtf8 = v8::String::REPLACE_INVALID_UTF8;
^
sysinfo.target.mk:88: recipe for target 'Release/obj.target/sysinfo/src/binding.o' failed
make: *** [Release/obj.target/sysinfo/src/binding.o] Error 1
make: Leaving directory '/root/phearjs/node_modules/usage/build'
gyp ERR! build error
gyp ERR! stack Error: `make` failed with exit code: 2
gyp ERR! stack at ChildProcess.onExit (/usr/share/node-gyp/lib/build.js:267:23)
gyp ERR! stack at ChildProcess.emit (events.js:98:17)
gyp ERR! stack at Process.ChildProcess._handle.onexit (child_process.js:809:12)
gyp ERR! System Linux 3.16.0-4-amd64
gyp ERR! command "nodejs" "/usr/bin/node-gyp" "rebuild"
gyp ERR! cwd /root/phearjs/node_modules/usage
gyp ERR! node -v v0.10.29
gyp ERR! node-gyp -v v0.12.2
gyp ERR! not ok
npm WARN This failure might be due to the use of legacy binary "node"
npm WARN For further explanations, please read
/usr/share/doc/nodejs/README.Debian
npm ERR! [email protected] install: `node-gyp rebuild`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script.
npm ERR! This is most likely a problem with the usage package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get their info via:
npm ERR! npm owner ls usage
npm ERR! There is likely additional logging output above.
npm ERR! System Linux 3.16.0-4-amd64
npm ERR! command "/usr/bin/nodejs" "/usr/bin/npm" "install"
npm ERR! cwd /root/phearjs
npm ERR! node -v v0.10.29
npm ERR! npm -v 1.4.21
npm ERR! code ELIFECYCLE
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /root/phearjs/npm-debug.log
npm ERR! not ok code 0
Please help me to solve this issue. Thanks
I try to use headers to send cookies, but encountered errors.
I boot two vmwares, running phearjs on Archlinux(192.168.190.133) and host it via supervisor and nginx, and I developed a simple LAMP on XUbuntu(192.168.190.128), which simulate my target visit server.
My env:
$ uname -a
Linux localhost 4.8.4-1-ARCH #1 SMP PREEMPT Sat Oct 22 18:26:57 CEST 2016 x86_64 GNU/Linux // archlinux
$ git log --pretty=oneline | head
52f17085994d1373d2247e19b622c9add8e49bf5 Use Pug instead of Jade
$ node -v
v6.9.1
$ phantomjs -v
2.1.1
I startup phearjs properly, and passed the README.md Example.
I paste the code encodeURIComponent(JSON.stringify({a: 1}))
into chrome js console, and it says:
"%7B%22a%22%3A1%7D"
So I concat http://192.168.190.133/?fetch_url=http%3A%2F%2F192.168.190.128%2Fcookie-test.php&headers=
with %7B%22a%22%3A1%7D
:
http://192.168.190.133/?fetch_url=http%3A%2F%2F192.168.190.128%2Fcookie-test.php&headers=%7B%22a%22%3A1%7D
If I visit the url above, I would encounter:
{"success":false,"reason":"Malformed request headers."}
I noticed that %22 would change to double quota "
, so I tried remove these double quota, using %7Ba%3A1%7D
, visiting the url below:
http://192.168.190.133/?fetch_url=http%3A%2F%2F192.168.190.128%2Fcookie-test.php&headers=%7Ba%3A1%7D
And I encounter:
{"success":false,"reason":"Additional headers not properly formatted, e.g.: encodeURIComponent('{extra: \"Yes.\"}')."}
I guess I misunderstood the usage of headers parameter.
I appreciate any help, it would be great if anyone can give me an example illustrating how to use headers.
Thanks in advance!
I found few mistakes that very important for me. Please help me.
In log file phearjs I have:
2016-07-08 00:04:52 [ worker:p-1:9052] Fetching http://pelikan18.ru.
2016-07-08 00:05:03 [ worker:p-1:9052] Redirected to http://pelikan18.ru/
2016-07-08 00:05:03 [ worker:p-1:9052] ResourceError on http://pelikan18.ru: Error downloading http://pelikan18.ru/ - server replied: Internal Server Error (http://pelikan18.ru/)
2016-07-08 00:05:03 [ worker:p-1:9052] Ending subprocess.
2016-07-08 00:05:03 [ worker:p-1:9052] Ended subprocess with status FAILED TO FETCH THIS URL: ERROR DOWNLOADING HTTP://PELIKAN18.RU/ - SERVER REPLIED: INTERNAL SERVER ERROR.
same with sovertec.ru and etc
phearjs can't get by raw=true [https://github.com/Tomtomgo/phearjs#usage]
With web site without mistakes phearjs working perfect
Hi Tom,
Each worker can handle a pool of requests, however there is currently no option or mechanism to monitor the number of queries each worker is handling, having this sort of information is useful for the following reasons:
Perhaps a good addition to the current API can be http://127.0.0.1:8100/?get_stats=workers
The output should contain:
{
"workers": {
"worker_a": {
"port": "8031",
"PID": "12512",
"processing": "2",
"status": "?",
"currently_processing": {
"request_1": {
"url": "http://www.google.com",
"time_in_pool": "20s",
"status": "processing",
"CPU": "cpu usage as a percentage",
"MEM": "Ram usage as a percentage"
}
"request_2": {
"url": "http://www.yahoo.com",
"time_in_pool": "10s",
"status": "idle",
"CPU": "cpu usage as a percentage",
"MEM": "Ram usage as a percentage"
}
}
}
"worker_b": {
"port": "8032",
"PID": "14512",
"processing": "2",
"status": "?",
"currently_processing": {
"request_1": {
"url": "http://www.jira.com",
"time_in_pool": "5s",
"status": "processing",
"CPU": "cpu usage as a percentage",
"MEM": "Ram usage as a percentage"
}
"request_2": {
"url": "http://www.happy.com",
"time_in_pool": "15s",
"status": "processing",
"CPU": "cpu usage as a percentage",
"MEM": "Ram usage as a percentage"
}
}
}
}
}
Thanks,
Fouad
I installed phearjs two ways, both with the current master revision: First try was system-wide under ubuntu, secondly via a nodejs docker container (+memcached).
Both fail when I try to request a web page:
"Ended process with status SERVICE UNAVAILABLE, MAXIMUM NUMBER OF ALLOWED CONNECTIONS REACHED.."
I increased the connections in the config.json file to 100 for both development and production. The status page works, and mentions the connection tries as "refused" and not as failed.
Any ideas what is going wrong here?
FYI, following the Installation Instructions, node phearjs
gives me Error: Cannot find module '/path/to/phearjs/node_modules/phearjs'
Workaround: the development option node phear.js
works fine for my needs.
Installing on OSX. These are my steps so far:
It appears to start normally on 8100, ending in "Phear started"
Getting no response from curl:
curl -X GET "http://localhost:8100?fetch_url=http%3A%2F%2Fwww.apple.com"
Getting no response from chrome browser:
http://localhost:8100/?fetch_url=http%3A%2F%2Fwww.apple.com
Any ideas?
Hi,
I need to get a list of all decoded_url that are created on phearjs/lib/worker.js --> onResourceReceived
page_inst.onResourceReceived = function(response) {
var decoded_url;
decoded_url = decodeURIComponent(response.url);
if (decoded_url === final_url && response.stage === "end") {
return headers[decoded_url] = response.headers;
}
};
Is there a way to get this list and store it together with the html on memcached already in place ?
Thanks
Hello there,
I tried to install @ Windows ,but failed , ofcourse I dont have memcached ,butI do have bitnami stack which I can install through : https://bitnami.com/stack/memcached
Anyway I am not interested in caching now,so
1- I have Node js installed
2- I have phantom js last version
3- No Memcached yet
4- Followed this code :
git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install
The installation create error and debug log is attached.
Now starting the service with : node phearjs , getting this error
module.js:471
throw err;
^
Error: Cannot find module 'C:\Bitnami\phearjs\phearjs'
at Function.Module._resolveFilename (module.js:469:15)
at Function.Module._load (module.js:417:25)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:394:7)
at startup (bootstrap_node.js:149:9)
at bootstrap_node.js:509:3
I created phearb, a ruby gem for connecting to a phearb server very easily!
Just think you need to know this in case you want to refer it on the README or just check it out!
Is in an early stage yet usable. As i need more features in my app i'm gonna be adding it to the gem.
Thanks
As follows:
request should be http://localhost:8100?fetch_url=domain.com&render=jpg
response:
{
"success": true,
"input_url": "http://such-website.com",
"final_url": "http://www.such-website.com/",
"request_headers": {},
"response_headers": {
"date": "Sun, 08 Feb 2015 15:11:22 GMT",
"content-encoding": "gzip",
"expires": "Sun, 08 Feb 2015 15:12:33 GMT",
"vary": "Accept-Encoding",
"cache-control": "max-age=60",
"last-modified": "Sun, 08 Feb 2015 15:11:33 GMT",
"content-type": "text/html; charset=utf-8"
},
"had_js_errors": false,
"content": "<rendered HTML>",
"rendered": "path/to/img.jpg"
}
And also there should be some config file to store images path.
Hello,
Is is possible to use phear JS for .net and Visual Studio?
Thank you!
Might be a nice addition to have the option to return images generated by phantomjs as Base64
Hi, amazing tool you've created!
Prerender.io uses meta tags to help navigate crawlers for 404s and 301s. https://prerender.io/documentation/best-practices
Does phear have similar functionality or do you advise letting the server handle this directly?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.