Giter Site home page Giter Site logo

tomasnorre / crawler Goto Github PK

View Code? Open in Web Editor NEW
54.0 31.0 80.0 20.84 MB

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

License: GNU General Public License v3.0

PHP 96.40% HTML 3.54% Shell 0.04% CSS 0.01%
typo3 extension crawler cache-warmup php indexer hacktoberfest

crawler's Introduction

TYPO3 Crawler

Latest Stable Version Total Downloads License Tests Scrutinizer Code Quality Code Coverage Coverage Status Mutation testing badge Psalm coverage Average time to resolve an issue Percentage of issues still open

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

You can include the crawler in your TYPO3 project with composer or from TER

composer require tomasnorre/crawler

Crawler processes

backend_processlist

Versions and Support

Release TYPO3 PHP Fixes will contain
12.x.y 12.4 8.1-8.3 Features, Bugfixes, Security Updates
11.x.y 10.4-11.5 7.4-8.1 Bugfixes, Security Updates, Since 11.0.3 PHP 8.1
10.x.y 9.5-11.0 7.2-7.4 Security Updates
9.x.y 9.5-11.0 7.2-7.4 As this version has same requirements as 10.x.y, there will be no further releases of this version, please update instead.
8.x.y Releases do not exist
7.x.y Releases do not exist
6.x.y 7.6-8.7 5.6-7.3 Security Updates

Documentation

Please read the documentation

To render the documentation locally, please use the official TYPO3 Documentation rendering Docker Tool. https://github.com/t3docs/docker-render-documentation

Contributions

Please see CONTRIBUTING.md

Honorable Previous Maintainers

  • Kasper Skaarhoj
  • Daniel Poetzinger
  • Fabrizio Branca
  • Tolleiv Nietsch
  • Timo Schmidt
  • Michael Klapper
  • Stefan Rotsch

crawler's People

Contributors

ayacoo avatar bmack avatar brotkrueml avatar chetan-thapliyal avatar cweiske avatar fl3s avatar hannatrunk avatar hawkeye1909 avatar infabo avatar kpnielsen avatar kraemer-igroup avatar localheinz avatar lochmueller avatar macjohnny avatar maximilian-walter avatar metapublic-gbr avatar patta avatar peterkraume avatar philippkuhlmay avatar rengaw83 avatar rvock avatar srotsch avatar stuartmcfarlane avatar stucki avatar tbal avatar tmotyl avatar tomasnorre avatar tomasvotruba avatar tstahn avatar werkraum-admin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawler's Issues

Regression - Crawler cannot create new processes

It seems that the commit 656cd90 introduced a regression. Occurs in TYPO3 6 & 7 LTS.

Steps to reproduce:

  1. Add pages to the crawler queue
  2. In BE, go to the Process Overview and add try to start a new process
  3. The process will not start and an error will appear.

screen shot 2016-01-22 at 09 21 02

Or

  1. Add pages to the crawler queue
  2. In the Scheduler, create and start the Crawler Run task.
  3. Two PHP warnings will occur but no entries from the queue will be executed.

screen shot 2016-01-22 at 09 25 09

TYPO3 4.5 scheduler namespace fatal error

The extension is defined as compatible with TYPO3 4.5, but if I try to open the scheduler I get

PHP Fatal error: Class 'TYPO3\CMS\Scheduler\Task\AbstractTask' not found in /.../typo3conf/ext/crawler/scheduler/class.tx_crawler_scheduler_im.php on line 32

because you used the TYPO3 6.2+ namespace classes.

Environment TYPO3 4.5.41
PHP 5.5.9

[BE Module] SelectBox - Menu call to 'index.php?id='

I'm trying to reindex my Webpage after the upgrade frpm 4.5 to 6.2. My actually version of Typo3 is 6.2.6 and crawler 3.6.2.

My problem is when I select in the SiteCrawler-Menu another option that the already select, it redirect me to the backend.php section of typo3. I don't know why another people don't have the same problem like me.

I haved the same problem with another extensions, and I solved it. I found the solution in this article:

http://docs.typo3.org/typo3cms/InsideTypo3Reference/CoreArchitecture/BackendModules/BackendModulesUsingTypo3Modphp/Index.html

, but i haved to use the function \TYPO3\CMS\Backend\Utility\BackendUtility::getModuleUrl()

If I am not on the wrong way, I think the call to the url on this context, is in the main function of the class "tx_crawler_contextMenu", and the HTML-code of the code will loaded on the Module (modfunc1/class.tx_crawler_modfunc1.php) in the function slectorBox. Whe I analyse the code of the Box in the Browser, I obtain that:

Start Crawling Crawler log Crawling Processes

Should not be the calling of the jumpToUrl to mod.php?M=crawler&moduleKey=xxxxx&id=xxx... ?

Do anybody know what's going on?

Thanks

Migrated from: https://forge.typo3.org/issues/63772

Warning generated in IconUtility

I've set up crawler for indexing through the CLI and am now getting the following warning whenever it runs:

PHP Warning: in_array() expects parameter 2 to be array, null given in /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Utility/IconUtility.php on line 594
PHP Stack trace:
PHP 1. {main}() /var/www/typo3/typo3_src-6.1.1/typo3/cli_dispatch.phpsh:0
PHP 2. include() /var/www/typo3/typo3_src-6.1.1/typo3/cli_dispatch.phpsh:65
PHP 3. tx_crawler_lib->CLI_main_im() /var/www/website/typo3/typo3conf/ext/crawler/cli/crawler_im.php:8
PHP 4. tx_crawler_lib->getPageTreeAndUrls() /var/www/website/typo3/typo3conf/ext/crawler/class.tx_crawler_lib.php:1941
PHP 5. TYPO3\CMS\Backend\Tree\View\AbstractTreeView->getTree() /var/www/website/typo3/typo3conf/ext/crawler/class.tx_crawler_lib.php:1583
PHP 6. TYPO3\CMS\Backend\Tree\View\AbstractTreeView->getTree() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Tree/View/AbstractTreeView.php:794
PHP 7. TYPO3\CMS\Backend\Tree\View\AbstractTreeView->getIcon() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Tree/View/AbstractTreeView.php:808
PHP 8. TYPO3\CMS\Backend\Utility\IconUtility::getSpriteIconForRecord() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Tree/View/AbstractTreeView.php:681
PHP 9. TYPO3\CMS\Backend\Utility\IconUtility::getSpriteIcon() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Utility/IconUtility.php:693
PHP 10. in_array() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Utility/IconUtility.php:594
It doesn't seem to impact the indexing setup, but it leaves a nasty message in the logs :P

IMPORTANT: Please see comment/discussion here: https://forge.typo3.org/issues/50558

Crawling Processes Page information getting into failure url when you click on any of the buttons there (Typo3 7.6.0)

When you get to the Backend Info page, then select the Site Crawler information page and then select the "Crawling Processes" from the Page information select box you got 3 buttons:
"Refresh", "Enable crawling" and "Show finished and terminated processes".
If you click on any of them, the right typo3 backend iframe window loading the index of the typo3 backend.
Look at the screenshots:
1 screen: the exact page where the buttons are
2 screen: the result when you click on any of them

1

2

Deleting old tx_crawler_processes entries ends up in slow query

This update query uses a lot of performance. Is there any way to improve this action?
To use UPDATE and set the process to deleted=1 leaves the table with a lot of old deleted entries.

$GLOBALS['TYPO3_DB']->exec_UPDATEquery(
'tx_crawler_process',
'active=0
AND NOT EXISTS (
SELECT * FROM tx_crawler_queue
WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
AND tx_crawler_queue.exec_time = 0
)',
array(
'deleted'=>'1'
)
);

We get this all the time:

Query
1033
Sending data
UPDATE tx_crawler_process SET deleted = '1' WHERE active =0 AND NOT EXISTS (SEL

Is there a possibility to use DELETE instead of UPDATE, so the table doesn't fill up that much?

IMPORTANT: Please see dicussion/questions at: https://forge.typo3.org/issues/6792, furture discussion will continue here.

Incompatible with Indexed_search

From the last release of 4.0.0, we got this error reported:

https://forge.typo3.org/issues/55106#note-17

Hi all,
I've installed and tried today the crawler latest dev version https://git.typo3.org/TYPO3CMS/Extensions/crawler.git/snapshot/20bfb07561b5e15f2324fd84751c633cb049dacd.zip on Typo3 6.2.7, PHP 5.5.9.
Unfortunately it doesn't seem to want to queue anything.

This is my process (following mostly the howto: http://xavier.perseguers.ch/tutoriels/typo3/articles/indexed-search-crawler.html) :
add a crawler record at the page tree root named simple-indexing
truncate all concerned mysql tables to have something clean
go to a given normal page with Web > info and have a look at Indexed Search (OK no entries)
stay in Web > info, but choose now Site crawler
choose the simple-indexing, and clic on Update and then on Crawl URLs
Here: Update refreshes the page but Count stays at zero.
Crawl URLs leads to another page and we can see "0 URL submitted".
I've done the same process this morning with the 3.5.0 crawler extension on the TER: the Count goes to 1, and I had "1 URL submitted", but starting the job manually has given no indexing results. It's how I've seen the current bug report here, and why I would to find a crawler running with my Typo3 version.

Thanks for your help.

Migrated from: https://forge.typo3.org/issues/68949

HTTP Authentication support

It would be very nice to have a HTTP Authentication support for the crawler, so websites (intranet mainly) that are protected by a htaccess authentication could be crawled too, especially if it's not possible to exclude the server itself from the authentication procedure.

Migrated from: https://forge.typo3.org/issues/47595

"crawl" context menu doesn't work - fatal error on TYPO3 6.2.19

Steps:

  1. Go to a page where you have crawler configuration record
  2. open a context (right click) menu on the crawler record
  3. choose "crawler" from the menu

Result:
Fatal error: Class 'TYPO3\CMS\Core\Utility\GeneralUtility' not found in typo3/sysext/info/mod1/index.php on line 21

TCA field 'base_url' displayCond is broken

The display condition in the 'base_url' column in TCA is broken. While it seems this isn't a big deal as you could potentially use the base url from the domain record - it is when you're trying to crawl via https:// as 'base_url' is the only way to achieve this...

Pull request follows shortly..

Hidden records should be ignored

I have a page where I display data records that are organized in a sysfolder. I'm using the following configuration to crawl the different contents (page uid is 13, sysfolder uid is 15):

tx_crawler.crawlerCfg.paramSets {
items = &tx_myext[items]=[_TABLE:tt_myext_items;_PID:15]
items {
pidsOnly = 13
cHash = 1
procInstrFilter = tx_indexedsearch_reindex
}
}
The problem: Hidden records in the sysfolder are still added to the queue and later on indexed as empty page, because the hidden content can not be shown. This will also result in entries in search results for users. These results will lead the user to an empty page, which is pretty bad.

Expected behavior: Hidden records should be ignored when building the initial crawler queue.

It seems that a similar problem is described in #7455, but that bug is closed.

IMPORTANT: See comments/discussion here: https://forge.typo3.org/issues/43655

Fix CLI: incorrect start index / high memory consumption.

Two related issues were posted to the forge bug tracker:

  • currently in CLI mode the startpage argument is used with wrong index. This leads to the problem that the crawler will alsways start crawling from id 0.
  • currently the crawler eats lots of memory and is running in timeouts even in CLI context.

A patch to address these issues was commited as well. It uses the passed page id instead of 0 as start pages. This patch should be merged in the Github repo as well.

It's not possible to mass-edit "Base url" field in the crawler configuration

Steps to reproduce:

  1. Go to the list view
  2. Click on the "Crawler Configuration" table header
  3. Add "Base url" column to the table

Actual: base url column is empty
Expected: base url column is filled with data

  1. Click on the pencil icon to mass edit

Actual: base url selectors will not appear
Expected: base url selectors will be visible

TYPO3 6.2.19

AJAX Call in CE Content Menu throws PHP Fatal Error

TYPO3 7.6
crawler 5.0

With the crawler extension installed the CE context menu stops working and throws a PHP Fatal Error each time it is clicked. From the log:

Core: Exception handler (WEB): Uncaught TYPO3 Exception: #1: PHP Catchable Fatal Error: Argument 1 passed to tx_crawler_contextMenu::main() must be an instance of clickMenu, instance of TYPO3\CMS\Backend\ClickMenu\ClickMenu given, called in /var/www/staging.example.org/typo3_src-7.6.0/typo3/sysext/backend/Classes/ClickMenu/ClickMenu.php on line 488 and defined in /var/www/staging.example.org/public/typo3conf/ext/crawler/class.tx_crawler_contextMenu.php line 43 | TYPO3\CMS\Core\Error\Exception thrown in file /var/www/staging.example.org/typo3_src-7.6.0/typo3/sysext/core/Classes/Error/ErrorHandler.php in line 111. Requested URL: http://staging.example.org/typo3/index.php?ajaxID=%2Fajax%2Fcontext-menu&ajaxToken=a3774c8d3796a751e31882c20de9e5c4c582a53b&table=tt_content&uid=227&listFr=1

Hope that helps!

MultiCrawler Output

Yesterday I tested and accepted #69331 but i'm not sure it's the right approach.

The setting processDebug is one thing and processVerbose is different, so i'm considering changing the $this->verbose setting to be used from processVerbose, which is a new setting that has to be introduced then.

Feedback appreciated.

IMPORTANT: Please see dicussion/questions here: https://forge.typo3.org/issues/69341 Furture discussion will continue here.

Something went wrong: process did not appear within 10 seconds.

Hi,

Just upgraded the crawler from 3.2.0 (TYPO3 4.5.22) to the latest 3.5.0 and it seems it does not work anymore.
When trying to add a process the error:

"Something went wrong: process did not appear within 10 seconds."

appears.

I have tried this on my local Windows dev machine and on the Linux server itself, and both seem to fail.

PHP version 5.3.15 on Windows and 5.3.3 on Linux.

IMPORTANT: See dicussion/questions at: https://forge.typo3.org/issues/44875

tx_crawler_lib::flushQueue() may unnecessarily eat tons of memory

Here is the relevant code from that function:

    if(tx_crawler_domain_events_dispatcher::getInstance()->hasObserver('queueEntryFlush')) {
        $groups = $GLOBALS['TYPO3_DB']->exec_SELECTgetRows('DISTINCT set_id','tx_crawler_queue',$realWhere);
        foreach($groups as $group) {
            tx_crawler_domain_events_dispatcher::getInstance()->post('queueEntryFlush',$group['set_id'], $GLOBALS['TYPO3_DB']->exec_SELECTgetRows('uid, set_id','tx_crawler_queue',$realWhere.' AND set_id="'.$group['set_id'].'"'));
        }
    }

Imagine what happens if observer does not need any row but exec_SELECTgetRows() returns 90000 rows. Would that be efficient? Let's count at least data: 32 bits = 4 bytes * 90000 = 360K of data. Add here PHP internal array structures and you'll get the idea how much memory this code needs. It is well beyond a typical 128K value that most servers have for TYPO3.

Much better is to supply a database resource, not the data. Then the function may do whatever it wants with the resource (seek, fetch, etc). Even better would be supply SQL query information about data, not the data itself.

Migrated from: https://forge.typo3.org/issues/8084

Crawling Processes Page Information - Refresh button does not work as expected in TYPO3 6.2

This is related to the change made in relation to this issue #37 (52c7b2d).

The commit introduces a regression.

Steps to reproduce in crawler version 5.0.1, TYPO3 6.2 or 7.6.2:

  1. Add pages to the crawler queue (5-10 pages)
  2. Manually start crawling processes (Add process)
  3. Click reload (maybe multiple times).
  4. When the processes are finished and exit, you will start seeing the error (because instead of refreshing you keep adding processes):

screen shot 2016-01-20 at 16 54 42

It seems that the action which is executed on "Refresh" is still "Add process" or most probably the last executed task.

Links on refresh icon/queue id do not work in TYPO3 7.6

The links on the refresh icon/queue id in the crawler log list do not contain all needed parameters:
index.php?id=1&qid_read=1&setID=0
should be:
/typo3/index.php?M=web_info&moduleToken={token}&qid_read=1&setID=0&id=1

I hope I'm in the right place here on github, I'll sumbit a pull request with a fix. If you prefer, I can submit the patch to gerrit, just let me know.

Core Bug #70052 isn't as resolved as you might think

See my comment on this commit: AOEpeople@9d9fe40

This 'bug' hasn't been resolved, the EXT: display condition has just been marked as deprecated and then they called it a day. So in 7.6.4 you still get an uncaught TYPO3 exception in the backend when the column sys_worspace_uid is defined and the EXT:versions is not installed..

I supplied a pull request reverting this commit..

Remove "Add Process"-button when no more queue-item can be assigned

UX Fix:

When the Pending Entries (assigned / overall ): 30 / 30 (equal each other) then the "add process" button should not be visible, as it will return and error that the process cannot be added.

We don't need this visible as it don't make sense to click it as no entries can be assigned anyway.

2016-01-27_crawler
2016-01-27_crawler_error

FEUser GroupList not resolved recursively when crawling

Unfortunately the list of groups a FE-User is member of is not resovled recursively.

It works with one level of subgroups, but if the subgroup has subgroups (and so on) different grlist entries are made the indexed_search pHash Tab (see Screenshot -> 1st entry made by manually opening the page in the browser, second row by crawler extension with same FE-Group entry).

Maybe setting the group-list in tx_crawler_hooks_tsfe->fe_feuserInit is not enough . .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.