Giter Site home page Giter Site logo

phearjs's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phearjs's Issues

It's working perfect, but short time

It's working perfect, but short time. After few hours it's stoping to hear on 8100 port. Say me please how can I fix in config file? I need service that working without stoping. Thanks a lot for autor.

Windows support

phearjs is effectively incompatible with Windows because of the memcache dependency. It may be desirable to have a config flag that removes the memcached dependency when running phearjs for Windows users who want the scraping features without caching capabilities.

P.S I can attempt to make a PR for this but it might take 1-2 weeks to find a good time for it. Let me know if I should.

A lot of thanks for great tool

One first... I want to say for users who want parsing web sites " Use https://github.com/Tomtomgo/phearjs ". It's really great when any wget, xidel and also...phantomjs.
Of course phearjs ussing phantomjs, like phantomjs ussing core linux )
But in a production when you need get information from more 1M web information - phearjs is best.
Thanks for autor.


Now, last question "how turn off download images like in phantomjs?"
(phantomjs --load-images=false )

inside docker, memcache not started

I put phantomjs, memcached, nodejs in a jessie box and installed as described.

root@6c54acf2dec0:/tmp/phearjs# node phear.js
2017-06-02 12:51:14 [          phear:8100] Starting Phear...
2017-06-02 12:51:14 [          phear:8100] ==================================
2017-06-02 12:51:14 [          phear:8100] Version: 0.6.2
2017-06-02 12:51:14 [          phear:8100] Mode: development
2017-06-02 12:51:14 [          phear:8100] Config file: ./config/config.json
2017-06-02 12:51:14 [          phear:8100] Port: 8100
2017-06-02 12:51:14 [          phear:8100] Workers: 4
2017-06-02 12:51:14 [          phear:8100] ==================================
2017-06-02 12:51:14 [          phear:8100] Worker 1 of 4 started.
2017-06-02 12:51:14 [          phear:8100] Worker 2 of 4 started.
2017-06-02 12:51:14 [          phear:8100] Worker 3 of 4 started.
2017-06-02 12:51:14 [          phear:8100] Worker 4 of 4 started.
2017-06-02 12:51:14 [          phear:8100] Phear started.
2017-06-02 12:51:14 [          phear:8100] Memcache failed: Error: connect ECONNREFUSED 127.0.0.1:11211
2017-06-02 12:51:14 [          phear:8100] Trying to kill process and 4 workers gently...
Terminated

Who starts memcached? Am I to do that? I thought phear would do it. Or what is the issue here?

I switched to docker because I am on boot2docker, so there is no way to to install phantomjs here, hence I cannot test as described. It should work inside my docker container, though.

Troubles then I am trying to install phearjs

Please help me. Many thanks.

What I have here:

root@sv2:~/phearjs# cat /etc/issue
Debian GNU/Linux 8 \n \l
root@sv2:~/phearjs# cat /etc/debian_version
8.4
root@sv2:~/phearjs# uname -m
x86_64
root@sv2:~/phearjs# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.4 (jessie)
Release:        8.4
Codename:       jessie
root@sv2:~/phearjs# nodejs -v
v0.10.29
root@sv2:~/phearjs# npm -v
1.4.21
root@sv2:~/phearjs# memcached -help | grep memcached
memcached 1.4.21
root@sv2:~/phearjs# ls
assets  build  CHANGELOG.md  config  gulpfile.js  INSTALLATION.md  lib  node_modules  npm-debug.log  package.json  phear.js  README.md  restart_worker.sh  src

How I am trying to install:
root@sv2:~/phearjs# npm install
and got

npm WARN package.json [email protected] No README data
npm WARN deprecated [email protected]: Jade has been renamed to pug, please install the latest version of pug instead of jade
\
> [email protected] install /root/phearjs/node_modules/usage
> node-gyp rebuild

make: Entering directory '/root/phearjs/node_modules/usage/build'
  CXX(target) Release/obj.target/sysinfo/src/binding.o
In file included from ../src/binding.cpp:2:0:
../node_modules/nan/nan.h:324:47: error: ‘REPLACE_INVALID_UTF8’ is not a member of ‘v8::String’
   static const unsigned kReplaceInvalidUtf8 = v8::String::REPLACE_INVALID_UTF8;
                                               ^
sysinfo.target.mk:88: recipe for target 'Release/obj.target/sysinfo/src/binding.o' failed
make: *** [Release/obj.target/sysinfo/src/binding.o] Error 1
make: Leaving directory '/root/phearjs/node_modules/usage/build'
gyp ERR! build error 
gyp ERR! stack Error: `make` failed with exit code: 2
gyp ERR! stack     at ChildProcess.onExit (/usr/share/node-gyp/lib/build.js:267:23)
gyp ERR! stack     at ChildProcess.emit (events.js:98:17)
gyp ERR! stack     at Process.ChildProcess._handle.onexit (child_process.js:809:12)
gyp ERR! System Linux 3.16.0-4-amd64
gyp ERR! command "nodejs" "/usr/bin/node-gyp" "rebuild"
gyp ERR! cwd /root/phearjs/node_modules/usage
gyp ERR! node -v v0.10.29
gyp ERR! node-gyp -v v0.12.2
gyp ERR! not ok 
npm WARN This failure might be due to the use of legacy binary "node"
npm WARN For further explanations, please read
/usr/share/doc/nodejs/README.Debian

npm ERR! [email protected] install: `node-gyp rebuild`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] install script.
npm ERR! This is most likely a problem with the usage package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     node-gyp rebuild
npm ERR! You can get their info via:
npm ERR!     npm owner ls usage
npm ERR! There is likely additional logging output above.

npm ERR! System Linux 3.16.0-4-amd64
npm ERR! command "/usr/bin/nodejs" "/usr/bin/npm" "install"
npm ERR! cwd /root/phearjs
npm ERR! node -v v0.10.29
npm ERR! npm -v 1.4.21
npm ERR! code ELIFECYCLE
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /root/phearjs/npm-debug.log
npm ERR! not ok code 0

Please help me to solve this issue. Thanks

Cannot handle headers properly

I try to use headers to send cookies, but encountered errors.

I boot two vmwares, running phearjs on Archlinux(192.168.190.133) and host it via supervisor and nginx, and I developed a simple LAMP on XUbuntu(192.168.190.128), which simulate my target visit server.

My env:

$ uname -a
Linux localhost 4.8.4-1-ARCH #1 SMP PREEMPT Sat Oct 22 18:26:57 CEST 2016 x86_64 GNU/Linux // archlinux
$ git log --pretty=oneline | head
52f17085994d1373d2247e19b622c9add8e49bf5 Use Pug instead of Jade
$ node -v
v6.9.1
$ phantomjs -v
2.1.1

I startup phearjs properly, and passed the README.md Example.

I paste the code encodeURIComponent(JSON.stringify({a: 1})) into chrome js console, and it says:

"%7B%22a%22%3A1%7D"

So I concat http://192.168.190.133/?fetch_url=http%3A%2F%2F192.168.190.128%2Fcookie-test.php&headers= with %7B%22a%22%3A1%7D:

http://192.168.190.133/?fetch_url=http%3A%2F%2F192.168.190.128%2Fcookie-test.php&headers=%7B%22a%22%3A1%7D

If I visit the url above, I would encounter:

{"success":false,"reason":"Malformed request headers."}

I noticed that %22 would change to double quota ", so I tried remove these double quota, using %7Ba%3A1%7D, visiting the url below:

http://192.168.190.133/?fetch_url=http%3A%2F%2F192.168.190.128%2Fcookie-test.php&headers=%7Ba%3A1%7D

And I encounter:

{"success":false,"reason":"Additional headers not properly formatted, e.g.: encodeURIComponent('{extra: \"Yes.\"}')."}

I guess I misunderstood the usage of headers parameter.

I appreciate any help, it would be great if anyone can give me an example illustrating how to use headers.

Thanks in advance!

Can't parse web normaly if will find mistakes in web site

I found few mistakes that very important for me. Please help me.
In log file phearjs I have:

2016-07-08 00:04:52 [     worker:p-1:9052] Fetching http://pelikan18.ru.
2016-07-08 00:05:03 [     worker:p-1:9052] Redirected to http://pelikan18.ru/
2016-07-08 00:05:03 [     worker:p-1:9052] ResourceError on http://pelikan18.ru: Error downloading http://pelikan18.ru/ - server replied: Internal Server Error (http://pelikan18.ru/)
2016-07-08 00:05:03 [     worker:p-1:9052] Ending subprocess.
2016-07-08 00:05:03 [     worker:p-1:9052] Ended subprocess with status FAILED TO FETCH THIS URL: ERROR DOWNLOADING HTTP://PELIKAN18.RU/ - SERVER REPLIED: INTERNAL SERVER ERROR.

same with sovertec.ru and etc
phearjs can't get by raw=true [https://github.com/Tomtomgo/phearjs#usage]
With web site without mistakes phearjs working perfect

Monitor individual worker capacity

Hi Tom,

Each worker can handle a pool of requests, however there is currently no option or mechanism to monitor the number of queries each worker is handling, having this sort of information is useful for the following reasons:

  1. It will help determine how efficient the Phear system against a certain set of concurrent continues request streams.
  2. It will provide a more granular set of information.
  3. The information ca be used by an external program to moderate the traffic going to Phear, such as a load balancer.
  4. It could help determine if there are any idle or zombie requests that are stuck in the pool.

Perhaps a good addition to the current API can be http://127.0.0.1:8100/?get_stats=workers

The output should contain:

  1. Worker information [name, port, current capacity]
  2. Current url's being processed per worker.
  3. Time each url has been in the pool.
  4. All the above can be placed in JSON format, as an example:
{
    "workers": {
        "worker_a": {
            "port": "8031",
            "PID": "12512",
            "processing": "2",
            "status": "?",
            "currently_processing": {
                "request_1": {
                    "url": "http://www.google.com",
                    "time_in_pool": "20s",
                    "status": "processing",
                    "CPU": "cpu usage as a percentage",
                    "MEM": "Ram usage as a percentage"
                }
                "request_2": {
                    "url": "http://www.yahoo.com",
                    "time_in_pool": "10s",
                    "status": "idle",
                    "CPU": "cpu usage as a percentage",
                    "MEM": "Ram usage as a percentage"
                }
            }
        }
        "worker_b": {
            "port": "8032",
            "PID": "14512",
            "processing": "2",
            "status": "?",
            "currently_processing": {
                "request_1": {
                    "url": "http://www.jira.com",
                    "time_in_pool": "5s",
                    "status": "processing",
                    "CPU": "cpu usage as a percentage",
                    "MEM": "Ram usage as a percentage"
                }
                "request_2": {
                    "url": "http://www.happy.com",
                    "time_in_pool": "15s",
                    "status": "processing",
                    "CPU": "cpu usage as a percentage",
                    "MEM": "Ram usage as a percentage"
                }
            }
        }
    }
}

Thanks,
Fouad

SERVICE UNAVAILABLE

I installed phearjs two ways, both with the current master revision: First try was system-wide under ubuntu, secondly via a nodejs docker container (+memcached).

Both fail when I try to request a web page:

"Ended process with status SERVICE UNAVAILABLE, MAXIMUM NUMBER OF ALLOWED CONNECTIONS REACHED.."

I increased the connections in the config.json file to 100 for both development and production. The status page works, and mentions the connection tries as "refused" and not as failed.

Any ideas what is going wrong here?

Error on run

FYI, following the Installation Instructions, node phearjs gives me Error: Cannot find module '/path/to/phearjs/node_modules/phearjs'

Workaround: the development option node phear.js works fine for my needs.

Startup issue

Installing on OSX. These are my steps so far:

  1. Node Install from .pkg
  2. npm install phantomjs
  3. I used brew to install memcached: brew install memcached
  4. git clone https://github.com/Tomtomgo/phearjs.git
  5. cd phearjs
  6. npm install
  7. memcached -d -p 11211
  8. node phear.js

It appears to start normally on 8100, ending in "Phear started"

Getting no response from curl:
curl -X GET "http://localhost:8100?fetch_url=http%3A%2F%2Fwww.apple.com"

Getting no response from chrome browser:
http://localhost:8100/?fetch_url=http%3A%2F%2Fwww.apple.com

Any ideas?

Retrieve onResourceReceived list of urls

Hi,

I need to get a list of all decoded_url that are created on phearjs/lib/worker.js --> onResourceReceived

page_inst.onResourceReceived = function(response) {
      var decoded_url;
      decoded_url = decodeURIComponent(response.url);
      if (decoded_url === final_url && response.stage === "end") {
        return headers[decoded_url] = response.headers;
      }
    };

Is there a way to get this list and store it together with the html on memcached already in place ?

Thanks

Not able to install @ Windows 10 x64

Hello there,

I tried to install @ Windows ,but failed , ofcourse I dont have memcached ,butI do have bitnami stack which I can install through : https://bitnami.com/stack/memcached

Anyway I am not interested in caching now,so
1- I have Node js installed
2- I have phantom js last version
3- No Memcached yet
4- Followed this code :

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install

The installation create error and debug log is attached.

npm-debug.txt

Now starting the service with : node phearjs , getting this error

module.js:471
    throw err;
    ^

Error: Cannot find module 'C:\Bitnami\phearjs\phearjs'
    at Function.Module._resolveFilename (module.js:469:15)
    at Function.Module._load (module.js:417:25)
    at Module.runMain (module.js:604:10)
    at run (bootstrap_node.js:394:7)
    at startup (bootstrap_node.js:149:9)
    at bootstrap_node.js:509:3

Rubygem created

I created phearb, a ruby gem for connecting to a phearb server very easily!

Just think you need to know this in case you want to refer it on the README or just check it out!

Is in an early stage yet usable. As i need more features in my app i'm gonna be adding it to the gem.

Thanks

Add render page feature

As follows:

request should be http://localhost:8100?fetch_url=domain.com&render=jpg

response:

{
  "success": true,
  "input_url": "http://such-website.com",
  "final_url": "http://www.such-website.com/",
  "request_headers": {},
  "response_headers": {
    "date": "Sun, 08 Feb 2015 15:11:22 GMT",
    "content-encoding": "gzip",
    "expires": "Sun, 08 Feb 2015 15:12:33 GMT",
    "vary": "Accept-Encoding",
    "cache-control": "max-age=60",
    "last-modified": "Sun, 08 Feb 2015 15:11:33 GMT",
    "content-type": "text/html; charset=utf-8"
  },
  "had_js_errors": false,
  "content": "<rendered HTML>",
  "rendered": "path/to/img.jpg"
}

And also there should be some config file to store images path.

phearjs for .Net

Hello,

Is is possible to use phear JS for .net and Visual Studio?

Thank you!

return images as base64

Might be a nice addition to have the option to return images generated by phantomjs as Base64

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.