Giter Site home page Giter Site logo

luyadev / luya-module-crawler Goto Github PK

View Code? Open in Web Editor NEW
7.0 5.0 5.0 695 KB

Crawle a Website and provide intelligent search results

Home Page: https://luya.io

License: MIT License

PHP 72.71% HTML 27.29%
crawler luya search intelligent-search yii2 hacktoberfest

luya-module-crawler's Introduction

LUYA Logo

Crawler

LUYA Latest Stable Version Test Coverage Total Downloads Tests

An easy to use full-website page crawler to make provide search results on your page. The crawler module gather all information about the sites on the configured domain and stores the index in the database. From there you can now create search queries to provide search results. There are also helper methods which provide intelligent search results by splitting the input into multiple search queries (used by default).

LUYA Crawler Search Stats

Installation

Install the module via composer:

composer require luyadev/luya-module-crawler:^3.0

After installation via Composer include the module to your configuration file within the modules section.

'modules' => [
    //...
    'crawler' => [
        'class' => 'luya\crawler\frontend\Module',
        'baseUrl' => 'https://luya.io',
        /*
        'filterRegex' => [
            '#.html#i', // filter all links with `.html`
            '#/agenda#i', // filter all links which contain the word with leading slash agenda,
            '#date\=#i, // filter all links with the word date inside. for example when using an agenda which will generate infinite links
        ],
        'on beforeProcess' => function() {
            // optional add or filter data from the BuilderIndex, which will be processed to the Index afterwards
        },
        'on afterIndex' => function() {
            // optional add or filter data from the freshly built Index
        }
        */
    ],
    'crawleradmin' => 'luya\crawler\admin\Module',
]

Where baseUrl is the domain you want to crawler all information.

After setup the module in your config you have to run the migrations and import command (to setup permissions):

./vendor/bin/luya migrate
./vendor/bin/luya import

Running the Crawler

To execute the command (and run the crawler proccess) use the crawler command crawl, you should put this command in cronjob to make sure your index is up-to-date:

Make sure your page is in utf8 mode (<meta charset="utf-8"/>) and make sure to set the language <html lang="<?= Yii::$app->composition->langShortCode; ?>">.

./vendor/bin/luya crawler/crawl

In order to provide current crawl results you should create a cronjob which crawls the page each night: cd httpdocs/current && ./vendor/bin/luya crawler/crawl

Crawler Arguments

All crawler arguments for crawler/crawl, an example would be crawler/crawl --pdfs=0 --concurrent=5 --linkcheck=0:

name description default
linkcheck Whether all links should be checked after the crawler has indexed your site true
pdfs Whether PDFs should be indexed by the crawler or not true
concurrent The amount of conccurent page crawles 15

Stats

You can also get statistic results enabling a cronjob executing each week:

./vendor/bin/luya crawler/statistic

Create search form

Make a post request with query to the crawler/default/index route and render the view as follows:

<?php
use luya\helpers\Url;
use yii\widgets\LinkPager;
use luya\crawler\widgets\DidYouMeanWidget;
/* @var $query string The lookup query encoded */
/* @var $language string */
/* @var $this \luya\web\View */
/* @var $provider \yii\data\ActiveDataProvider */
/* @var $searchModel \luya\crawler\models\Searchdata */
?>

<form class="searchpage__searched-form" action="<?= Url::toRoute(['/crawler/default/index']); ?>" method="get">
    <input id="search" name="query" type="search" value="<?= $query ?>">
    <input type="submit" value="Search"/>
</form>

<h2><?= $provider->totalCount; ?> Results</h2>

<?php if ($query && $provider->totalCount == 0): ?>
    <div>No results found for &laquo;<?= $query; ?>&raquo;.</div>
<?php endif; ?>

<?= DidYouMeanWidget::widget(['searchModel' => $searchModel]); ?>
<?php foreach($provider->models as $item): /* @var $item \luya\crawler\models\Index */ ?>
    <h3><?= $item->title; ?></h3>
    <p><?= $item->preview($query); ?></p>
    <a href="<?= $item->url; ?>"><?= $item->url; ?></a>
<?php endforeach; ?>
<?= LinkPager::widget(['pagination' => $provider->pagination]); ?>

Crawler Settings

You can use crawler tags to trigger certains events or store informations:

tag example description
CRAWL_IGNORE <!-- [CRAWL_IGNORE] -->Ignore this<!-- [/CRAWL_IGNORE] --> Ignores a certain content from indexing.
CRAWL_FULL_IGNORE <!-- [CRAWL_FULL_IGNORE] --> Ignore a full page for the crawler, keep in mind that links will be added to index inside the ignore page.
CRAWL_GROUP <!-- [CRAWL_GROUP]api[/CRAWL_GROUP] --> Sometimes you want to group your results by a section of a page, in order to let crawler know about the group/section of your current page. Now you can group your results by the group field.
CRAWL_TITLE <!-- [CRAWL_TITLE]My Title[/CRAWL_TITLE] --> If you want to make sure to always use your customized title you can use the CRAWL_TITLE tag to ensure your title for the page:

luya-module-crawler's People

Contributors

agaplus avatar antikon avatar arhell avatar asyou99 avatar dennisgon avatar dependabot[bot] avatar dev7ch avatar hbugdoll avatar jdl747 avatar martinpetrasch avatar nadar avatar nandes2062 avatar ph0tonic avatar rodzadra avatar rolandschaub avatar samdark avatar testt23 avatar trk avatar vuongxuongminh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

luya-module-crawler's Issues

Add database indexes

Add index keys for builder and index table in order to speed up crawler process.

Abort if baseUrl wrong

Abort crawl and display a meaningful error message if the baseUrl is not existing / wrong.

crawler before run page index

This issue has originally been reported by @nadar at luyadev/luya#1443.
Moved here by @nadar.


'crawler' => [
    'class' => 'luya\crawler\frontend\Module',
    'indexer' => [
        'app\indexer\MyNewModuleIndex'
    ]
]
class MyNewModuleIndex extends CrawlerIndexer
{
     public function getLinks()
     {
           $data = [];
          foreach (News::find()->all() as $item) {
              $data[] = $item->getDetailUrl();
          }

          return $data;
     }
}

Search with results should be checked if result persits

The auto complete method uses the search index with successful queries, therefore this index should be checked by its consistency. In order to improve the index we should do:

  • Do not create another index if the query exists, just update the timestamp (last_time_appearance or sth). For search results tracking, we need to add them.
  • foreach the index and check whether results still found or not, update the count. This could be done at the end of the crawler command.

cut UTF-8 string problem

What steps will reproduce the problem?

Cyrillic string in query

What is the expected result?

image

What do you get instead?

image

cannot crawl to submenu level >2

Hi. Crawlers are now able to find and save titles/pages at the top-level (level 1) and 1 level below (level 2). but crawlers don't work on page/content/title at level 3 and so on. check my screenshot.

and also page on level 3 is not displayed

What is the expected result?

crawler should get all page/content/title in any level

What do you get instead? (A Screenshot can help us a lot!)

The admin Page
1

The search page. after I run ./vendor/bin/luya crawler/crawl
2

page on level 3 is not displayed sorry for posting in here
3

LUYA Check ouput (run this script and post the result: luyacheck.php)

1: [in_array('mod_rewrite', apache_get_modules())] true
2: [ini_get('short_open_tag')] ''
3: [ini_get('error_reporting')] '22527'
4: [phpversion()] '7.4.3'
5: [php_ini_loaded_file()] '/etc/php/7.4/apache2/php.ini'
6: [php_sapi_name()] 'apache2handler'
7: [isset($_SERVER['SERVER_SOFTWARE']) ? $_SERVER['SERVER_SOFTWARE'] : unknown] 'Apache/2.4.41 (Ubuntu)'

Additional infos

Q A
LUYA Version ^4.0
PHP Version 7.4.3
Platform Apache2, mysql.
Operating system linux ubuntu.

empty search php warning

PHP Warning โ€“ yii\base\ErrorException
count(): Parameter must be an array or an object that implements Countable

preg match delimiter

What steps will reproduce the problem?

Seach with delimiter value

What is the expected result?

should work!

What do you get instead? (A Screenshot can help us a lot!)

preg_match_all(): Unknown modifier 'd'

LUYA Check ouput (run this script and post the result: luyacheck.php)

Additional infos

Q A
LUYA Version
PHP Version
Platform Apache/XAMPP/MAMPP/etc.
Operating system Windows/Linux Server/OSX/etc.

<!-- [CRAWL_IGNORE]--> behaviour

Does the <!-- [CRAWL_IGNORE]--> also prevents the crawler from following links or just from indexing the given content? It's not clear from the readme.

I think it should still follow the links and don't index the content, that's the most obvious behaviour for a crawler, isn't it? To prevent the crawler from following links, something like <!-- [CRAWL_NOFOLLOW] --> could be used โ€“ but I can't think of any use case for that.

did you mean widget with empty params

If a search model is provided which is empty or false we should just not rendern anything.

Otherwise this can lead into a bug when do a search request with an empty query.

Crawler Refactor

Full refactor of the crawler.

What must be included/features:

  • filter urls by a regex (do not follow them, blacklist, or for circular references or infinite links like calendars)
  • config to blacklist certain extensions which might be the same host, but can safe memory when not fowllowing them
  • crawler type by format: html = crawler links, pdf = crawl pdf
  • check external links whether they return a 200 or not (linkchecker)
  • indexer interface can provide additional links from outside

type settings:

html:

  • whether to use title, h1 or CRAWL_TITLE
  • CRAWL_GROUP
  • CRAWL_FULL_IGNORE
  • CRAWL_IGNORE (the text inside the section will be ignored)

luya\crawler\models\Index preview error

return StringHelper::highlightWord($content, explode(" ", $word), $highlight);

This line will couse a "preg_match_all(): Compilation failed: nothing to repeat at offset 0" error if a $word variable would have empty space at the end. Explode will make an empty string.

Example:
$x = new \luya\crawler\models\Index();
$x->content = 'Test test test';
echo $x->preview('Test ');

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.