Giter Site home page Giter Site logo

duzun / hquery.php Goto Github PK

View Code? Open in Web Editor NEW
351.0 24.0 75.0 3.4 MB

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.

Home Page: https://duzun.me/playground/hquery

License: MIT License

HTML 0.17% PHP 95.65% JavaScript 0.37% Shell 3.81%
hquery crawler scraper html parser psr-4 psr-0 php selectors domcrawler

hquery.php's Introduction

hQuery.php Donate

An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.

You can use the familiar jQuery/CSS selector syntax to easily find the data you need.

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.

See tests/README.md.

API Documentation

πŸ’‘ Features

  • Very fast parsing and lookup
  • Parses broken HTML
  • jQuery-like style of DOM traversal
  • Low memory usage
  • Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
  • Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())
  • Caches response for multiple processing tasks
  • PSR-7 friendly (see hQuery::fromHTML($message))
  • PHP 5.3+
  • No dependencies

πŸ›  Install

Just add this folder to your project and include_once 'hquery.php'; and you are ready to hQuery.

Alternatively composer require duzun/hquery

or using npm install hquery.php, require_once 'node_modules/hquery.php/hquery.php';.

βš™ Usage

Basic setup:

// Optionally use namespaces
use duzun\hQuery;

// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';

// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";

// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour

I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.

Load HTML from a file

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);

Where $context is created with stream_context_create().

For an example of using $context to make a HTTP request with proxy see #26.

Load HTML from a string

hQuery::fromHTML( string $html, string $url = NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';

Load a remote HTML document

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )
use duzun\hQuery;

// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);

For building advanced requests (POST, parameters etc) see hQuery::http_wr(), though I recommend using a specialized (PSR-7?) library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg.

PSR-7 example:

composer require php-http/message php-http/discovery php-http/curl-client

If you don't have cURL PHP extension, just replace php-http/curl-client with php-http/socket-client in the above command.

use duzun\hQuery;

use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;

$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();

$request = $messageFactory->createRequest(
  'GET',
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);

$response = $client->sendRequest($request);

$doc = hQuery::fromHTML($response, $request->getUri());

Another option is to use stream_context_create() to create a $context, then call hQuery::fromFile($url, false, $context).

Processing the results

hQuery::find( string $sel, array|string $attr = NULL, hQuery\Node $ctx = NULL )
// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();

// If the result of find() is not empty
// $banners is a collection of elements (hQuery\Element)
if ( $banners ) {

    // Iterate over the result
    foreach($banners as $pos => $a) {
        $links[$pos] = $a->attr('href'); // get absolute URL from href property
        $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text

        // Filter the result
        if ( !$a->hasClass('logo') ) {
            // $a->style property is the parsed $a->attr('style')
            if ( strtolower($a->style['position']) == 'fixed' ) continue;

            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
        }
    }

    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;

        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

πŸ–§ Live Demo

On DUzun.Me

A lot of people ask for sources of my Live Demo page. Here we go:

view-source:https://duzun.me/playground/hquery

πŸƒ Run the playground

You can easily run any of the examples/ on your local machine. All you need is PHP installed in your system. After you clone the repo with git clone https://github.com/duzun/hQuery.php.git, you have several options to start a web-server.

Option 1:
cd hQuery.php/examples
php -S localhost:8000

# open browser http://localhost:8000/
Option 2 (browser-sync):

This option starts a live-reload server and is good for playing with the code.

npm install
gulp

# open browser http://localhost:8080/
Option 3 (VSCode):

If you are using VSCode, simply open the project and run debugger (F5).

πŸ”§ TODO

  • Unit tests everything
  • Document everything
  • Cookie support (implemented in mem for redirects)
  • Improve selectors to be able to select by attributes
  • Add more selectors
  • Use HTTPlug internally

πŸ’– Support my projects

I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).

If you like what I'm doing and this project helps you reduce time to develop, please consider to:

  • β˜… Star and Share the projects you like (and use)
  • β˜• Give me a cup of coffee - PayPal.me/duzuns (contact at duzun.me)
  • β‚Ώ Send me some Bitcoin at this addres: bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa (or using the QR below) bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa

hquery.php's People

Contributors

dependabot[bot] avatar duzun avatar fantom409 avatar gibex avatar marcosraudkett avatar sekedus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hquery.php's Issues

Cache needed?

When using this library, is setting the cache folder absolutely necessary? What would happen if we didn't put it?

how i get a value from a span?

hi great stuff and very fast!!

but how i can get a value from a sapn:
i have that line in the html page:
<span class="price_value" itemprop="price">3,850 β‚ͺ</span>

ho wi get the value ? (that:3,850 β‚ͺ)
i need the faster way to get the value.

i try that:
$doc->find('span .price_value')->text()

and its work
but maybe have a better way to get the info or faster

and if i do a 3-5 times reffresh i get that error:
Fatal error: Call to a member function text() on null in

how i can fix that?

thanks :)

ErrorException: reset() expects parameter 1 to be array

public function hasClass($className) {
$ret = $this->doc()->hasClass($this, $className);
if ( count($this) < 2 ) return reset($ret);
return max($ret);
}

I get the following error:

ErrorException: reset() expects parameter 1 to be array, boolean given in /home/vagrant/sellercrew/vendor/duzun/hquery/src/hQuery/Element.php:163

My best guess here is that this assumes the class to be present in at least one place in the whole DOM. If not present, the $ret becomes false.

Initial Update

Hi πŸ‘Š

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! πŸ€–

Make faster when looping through many pages

Hi I have successfully used your package and it is really great, upon using it with only one url it's kinda fast, but when I try to use it with multple url through looping, 50 url can take upto 10minutes.

How do I make it work faster?

Redirection Follow: problem

Hello,

I have noticed some problems with redirection. The library fails on me and also on the demo: https://duzun.me/playground/hquery

Example URLs to check:
http://sorellhotels.com
http://bΓΌcher.ch
http://ipet.ch

As example to explain the problem and my debuging I will use http://sorellhotels.com

This url has 3 redirections as follows:

hQuery does this:

He changes the host wrong ... Instead of "sorellhotels.com" he uses "tls"

Can you please check this? Thx

Best regards

Does it work? multiple class

I'm trying to get information from .typeHighlight class from link below:
trulia.com/property/1061429905-West-End-Heights-273-Barfield-Ave-SW-Atlanta-GA-30310

Instead of 11 nodes, I get 5. It's odd because if I use a simple html, I get the right results.

Undefined variable: te

Pri zaprose $data = hQuery::fromUrl("https://www.google.ru/");

Poluchaju vot takie preduprezhdenija!

Notice:  Undefined variable: te in ...\duzun\hquery\hquery.php on line 2907

Notice:  Undefined variable: te in ...\duzun\hquery\hquery.php on line 2909

Notice:  Undefined variable: te in ...\duzun\hquery\hquery.php on line 2915

Support for HTTP proxies

Hi @duzun,

Great work on the library. Are you open to adding support for HTTP proxies to it?

I've done a GIst test PHP which demos what's involved (currently without proxy authentication).

Alternatively, is there a way to use hQuery::fromFile() with create_stream_context()?

I'll investigate, and update this ticket ;).

Thanks,

Nick

Scrape in background

I love what you've done with this! I was wondering if there was any way to have hQuery queue up as a background process. I built a tool with this API, but while it's scraping, no other pages on my local server will load until it is completely finished. Is there some sort of functionality for this?

javascript

Hi,
is it possible evaluate javascript in the page before the page will navigate?
Thanks

There's an error on test

Fatal error: Using $this when not in object context in C:\xampp\htdocs\scraper\duzun.me_playground_hquery.php on line 13

Get Attribute value

How we can get attribute value from this string ?

<div id="cerberus-data-metrics" style="display: none;" data-asin="B00EAHSBV4" data-asin-price="24.55" data-asin-shipping="0" data-asin-currency-code="USD" data-substitute-count="-1" data-device-type="WEB" data-display-code="Asin is not eligible because it has a retail offer" ></div>

following does not work:
$p = $doc->find_text('#cerberus-data-metrics','data-asin-price');

Error on similar attributes

Hi!
I found an error when there are similar attributes on the same tag.

If you have src and src2 attribute on an img tag and try to use ->attr('src'), it founds nothing. If you use ->attr('src2') it found the attr text as expected.

:nth-child support

Hi,

I tried to use :nth-child but it seems to not be working.

The rule I'm using is this one:
.product > .row > .col-12 > .row:nth-child(2)

Thanks

Running with JS enabled

Guessing I know the answer to this, but the site I was scraping data from recently required JS to be enabled to view the main contents.

Do you know of any way around this?

Add condtions to element

How do I filter which element is being processed?

for example, I have mutiple classes:
$sels = "h2, .pnlDescription, .address";

I want to have a condition with the ".address" since I want to alter the text within it.

if(".address" == "somestring")
{
----code---
}

Where and how can I achieve this one?

Replace

Hi!
How can I:

  1. Replace tag? For ex: <title>News</title> replace to <title>My Site</title>
  2. Find all HTML code from to and delete them?

get_cache

How to utilize get_cache method? i.e., how to determine the particular cache file name/$fn argument?

"Cannot redeclare class duzun\hQuery" right after install with composer

I'm installing hQuery in laravel with composer:

dev@laravel:~/apps/testapp$ composer require duzun/hquery
Using version ^1.5 for duzun/hquery
./composer.json has been updated
Warning: You should avoid overwriting already defined auth settings for github.com.
Loading composer repositories with package information
Updating dependencies (including require-dev)
  - Installing duzun/hquery (1.5.0)
    Downloading: 100%

Writing lock file
Generating autoload files
> Illuminate\Foundation\ComposerScripts::postUpdate
> php artisan optimize
Generating optimized class loader
dev@laravel:~/apps/testapp$ php artisan runcommand

  [ErrorException]
  Cannot redeclare class duzun\hQuery

runcommand contents:

use duzun\hQuery;
hQuery::$cache_path = '/home/dev/apps/testapp/storage/cache';

I've commented line //class_alias('hQuery', 'duzun\\hQuery'); in psr-4\hQuery.php and that solved the issue, but I'm not sure is that ok or not :)

Find element by data attribute

Is it possible to find elements by their data attribute?

I have tried:

$dom->find('span[data-price]'))

But this doesn't find any spans with that data attribute which are there!

Exception Error warning

I use symfony 3.2 and got this warning.

php.DEBUG: Notice: Undefined index: CONTENT_ENCODING {"exception":"[object] (Symfony\Component\Debug\Exception\SilencedErrorContext: {"severity":8,"file":"/../USER/MYPROJECT/vendor/duzun/hquery/hquery.php","line":2967})"}

Notice: Undefined index: method also appears in line 2751.
Can you please check this?

Error in file permissions

[2] An error occurred in file /var/www/html/app/vendor/duzun/hquery/src/hQuery.php on line 629: filemtime(): stat failed

I am using custom framework with blade templating engine. The cache files for blade is getting stored without any issues. But getting this error in your plugin. I have set the default permission for the cache folder as rwx for all users. But still getting error

can't get a website (help, question)

Hey, I have this code, it works perfectly on localhost, but it doesn't on my server

Can you help me?

Note: I know this is not a cute code, it is just for the example

$doc = hQuery::fromUrl(
    'http://www.submanga.com/Naruto'
  , array(
        'Accept'     => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    )
);

Selector .class1.class2 seems to not work

If I want to get all elements that have both class1 and class2, it seems class2 is ignored and all results that have class1 are returned.

I'm using like this:
$item = $doc->find( '.header-product-info--price.margin-bottom-10' );

I get all the items that have class header-product-info--price, even the ones that don't have class margin-bottom-10.

how hasClass is implemented?

I see mentions in the code, could you please give an example of how should I use it.
Thanks! Great lib and I still use it :)

how you get node text without it's children?

Hi, first of all, thanks for the nice lib.

for example, we have code like this:

<li id="listItem">
    This is some text
    <span id="firstSpan">First span text</span>
    <span id="secondSpan">Second span text</span>
</li>

We need to get This is some text only. the text() method will give us This is some textFirst span textSecond span text

There's solution for jQuery: http://stackoverflow.com/questions/3442394/jquery-using-text-to-retrieve-only-text-not-nested-in-child-tags

Is it possible with hQuery?
Thanks.

Use proxy with fromURL

Hello,
how can I use a proxy with fromURL?

I use that code:

$doc = hQuery::fromUrl(
                $url
                , [
                    'Accept'     => $accept_html,
                    'User-Agent' => $user_agent,
                    'Referer'    => $referer,
                ]
 );

thank you
Jochen

DS not defined

Line 1890 uses DS which is not defined?

Where should this be done?

how remove a node/element ?

Many html page contain script or other sort of unpleasant elements that need to be delete before fetching text. So a way to search and delete element is needed. so please if there is any way to delete a node describe how , and if not please plan to implement or guide me to add this option.

Thanks

503 error code

I try to run hQuery but I receive 503 error code.
What is this?

How to retrieve next element?

foreach($prod->find('.product-options dt') as $v) { echo str_replace('*','', $v->text()).' '; echo $v->next(); }
The next element is a dd, but echo next returns nothing. How do I get the next element object? I need to get the values for dt and dd within every dl, but dd is a select element with multiple values I need.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.