tomasnorre / crawler Goto Github PK

View Code? Open in Web Editor NEW

54.0 31.0 80.0 20.84 MB

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

License: GNU General Public License v3.0

PHP 96.40% HTML 3.54% Shell 0.04% CSS 0.01%

typo3 extension crawler cache-warmup php indexer hacktoberfest

crawler's Introduction

TYPO3 Crawler

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

You can include the crawler in your TYPO3 project with composer or from TER

composer require tomasnorre/crawler

Crawler processes

Versions and Support

Release	TYPO3	PHP	Fixes will contain
12.x.y	12.4	8.1-8.3	Features, Bugfixes, Security Updates
11.x.y	10.4-11.5	7.4-8.1	Bugfixes, Security Updates, Since 11.0.3 PHP 8.1
10.x.y	9.5-11.0	7.2-7.4	Security Updates
9.x.y	9.5-11.0	7.2-7.4	As this version has same requirements as 10.x.y, there will be no further releases of this version, please update instead.
8.x.y			Releases do not exist
7.x.y			Releases do not exist
6.x.y	7.6-8.7	5.6-7.3	Security Updates

Documentation

Please read the documentation

To render the documentation locally, please use the official TYPO3 Documentation rendering Docker Tool. https://github.com/t3docs/docker-render-documentation

Contributions

Please see CONTRIBUTING.md

Honorable Previous Maintainers

Kasper Skaarhoj
Daniel Poetzinger
Fabrizio Branca
Tolleiv Nietsch
Timo Schmidt
Michael Klapper
Stefan Rotsch

crawler's People

Contributors

Stargazers

Watchers

Forkers

eyecatchup lfd4 christianfutterlieb macopedia bernhardberger lonson megamisan motordigital kmcs cornerfarmer annabarbosa mogic-le revoltek-daniel dwinkelbauer felixsemmler cdro r3h6 nicodh stuartmcfarlane zeroseven tbal kklefenz dacyberpunk negrul kaystrobach hemmerch webit-de jpmschuler tintusumith chesio tstahn bmack stefa50 cosmocode georgringer fabtho vzz3 fl3s infabo rengaw83 merzilla ghanshyambhava ichhabrecht jpgreth tomasvotruba stucki werkraum possi brotkrueml carstendietrich rintisch peterkraume landerop gelbehexe siggwer localheinz kraemer-igroup samuelliu robertdigital aoepeople patta lochmueller kpnielsen kanti lukedlbrg schoeppe mbrodala maikschneider communiacs koehnlein hawkeye1909 nhovratov indyindyindy n3amil metapublic-gbr t3easy olilo

crawler's Issues

Regression - Crawler cannot create new processes

It seems that the commit 656cd90 introduced a regression. Occurs in TYPO3 6 & 7 LTS.

Steps to reproduce:

Add pages to the crawler queue
In BE, go to the Process Overview and add try to start a new process
The process will not start and an error will appear.

Add pages to the crawler queue
In the Scheduler, create and start the Crawler Run task.
Two PHP warnings will occur but no entries from the queue will be executed.

Include additional linting from tomasnorre/t3ee_example

No php binary found in '/usr/bin/php/' -> Linuxserver

I have an error with the side-crawler 3.5.0 in TYPO3 4.5.10 :

No php binary found in '/usr/bin/php/'. Please update value for 'phpPath' in crawler extension setup.

the php binary is there.

Downgrade to 3.2.0 fixed the problem.

MIgrated from: https://forge.typo3.org/issues/47100

TYPO3 4.5 scheduler namespace fatal error

The extension is defined as compatible with TYPO3 4.5, but if I try to open the scheduler I get

PHP Fatal error: Class 'TYPO3\CMS\Scheduler\Task\AbstractTask' not found in /.../typo3conf/ext/crawler/scheduler/class.tx_crawler_scheduler_im.php on line 32

because you used the TYPO3 6.2+ namespace classes.

Environment TYPO3 4.5.41
PHP 5.5.9

[BE Module] SelectBox - Menu call to 'index.php?id='

I'm trying to reindex my Webpage after the upgrade frpm 4.5 to 6.2. My actually version of Typo3 is 6.2.6 and crawler 3.6.2.

My problem is when I select in the SiteCrawler-Menu another option that the already select, it redirect me to the backend.php section of typo3. I don't know why another people don't have the same problem like me.

I haved the same problem with another extensions, and I solved it. I found the solution in this article:

http://docs.typo3.org/typo3cms/InsideTypo3Reference/CoreArchitecture/BackendModules/BackendModulesUsingTypo3Modphp/Index.html

, but i haved to use the function \TYPO3\CMS\Backend\Utility\BackendUtility::getModuleUrl()

If I am not on the wrong way, I think the call to the url on this context, is in the main function of the class "tx_crawler_contextMenu", and the HTML-code of the code will loaded on the Module (modfunc1/class.tx_crawler_modfunc1.php) in the function slectorBox. Whe I analyse the code of the Box in the Browser, I obtain that:

Start Crawling Crawler log Crawling Processes

Should not be the calling of the jumpToUrl to mod.php?M=crawler&moduleKey=xxxxx&id=xxx... ?

Do anybody know what's going on?

Thanks

Migrated from: https://forge.typo3.org/issues/63772

Check addNotice function

This function looks like its not used/needed anymore. Please check.

Warning generated in IconUtility

I've set up crawler for indexing through the CLI and am now getting the following warning whenever it runs:

PHP Warning: in_array() expects parameter 2 to be array, null given in /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Utility/IconUtility.php on line 594
PHP Stack trace:
PHP 1. {main}() /var/www/typo3/typo3_src-6.1.1/typo3/cli_dispatch.phpsh:0
PHP 2. include() /var/www/typo3/typo3_src-6.1.1/typo3/cli_dispatch.phpsh:65
PHP 3. tx_crawler_lib->CLI_main_im() /var/www/website/typo3/typo3conf/ext/crawler/cli/crawler_im.php:8
PHP 4. tx_crawler_lib->getPageTreeAndUrls() /var/www/website/typo3/typo3conf/ext/crawler/class.tx_crawler_lib.php:1941
PHP 5. TYPO3\CMS\Backend\Tree\View\AbstractTreeView->getTree() /var/www/website/typo3/typo3conf/ext/crawler/class.tx_crawler_lib.php:1583
PHP 6. TYPO3\CMS\Backend\Tree\View\AbstractTreeView->getTree() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Tree/View/AbstractTreeView.php:794
PHP 7. TYPO3\CMS\Backend\Tree\View\AbstractTreeView->getIcon() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Tree/View/AbstractTreeView.php:808
PHP 8. TYPO3\CMS\Backend\Utility\IconUtility::getSpriteIconForRecord() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Tree/View/AbstractTreeView.php:681
PHP 9. TYPO3\CMS\Backend\Utility\IconUtility::getSpriteIcon() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Utility/IconUtility.php:693
PHP 10. in_array() /var/www/typo3/typo3_src-6.1.1/typo3/sysext/backend/Classes/Utility/IconUtility.php:594
It doesn't seem to impact the indexing setup, but it leaves a nasty message in the logs :P

IMPORTANT: Please see comment/discussion here: https://forge.typo3.org/issues/50558

Setting up Travis-CI and Scrutinizer

This is practically finished, it might have small adjustments, but in general it should be ready:

Travis: https://travis-ci.org/tomasnorre/crawler/builds/73626092
Branch: https://github.com/tomasnorre/crawler/tree/Feature/Travis-Scrutinizer
Commit: 9646a7f

Migrated from: https://forge.typo3.org/issues/68951

Crawling Processes Page information getting into failure url when you click on any of the buttons there (Typo3 7.6.0)

When you get to the Backend Info page, then select the Site Crawler information page and then select the "Crawling Processes" from the Page information select box you got 3 buttons:
"Refresh", "Enable crawling" and "Show finished and terminated processes".
If you click on any of them, the right typo3 backend iframe window loading the index of the typo3 backend.
Look at the screenshots:
1 screen: the exact page where the buttons are
2 screen: the result when you click on any of them

Deleting old tx_crawler_processes entries ends up in slow query

This update query uses a lot of performance. Is there any way to improve this action?
To use UPDATE and set the process to deleted=1 leaves the table with a lot of old deleted entries.

$GLOBALS['TYPO3_DB']->exec_UPDATEquery(
'tx_crawler_process',
'active=0
AND NOT EXISTS (
SELECT * FROM tx_crawler_queue
WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
AND tx_crawler_queue.exec_time = 0
)',
array(
'deleted'=>'1'
)
);

We get this all the time:

Query
1033
Sending data
UPDATE tx_crawler_process SET deleted = '1' WHERE active =0 AND NOT EXISTS (SEL

Is there a possibility to use DELETE instead of UPDATE, so the table doesn't fill up that much?

IMPORTANT: Please see dicussion/questions at: https://forge.typo3.org/issues/6792, furture discussion will continue here.

Incompatible with Indexed_search

From the last release of 4.0.0, we got this error reported:

https://forge.typo3.org/issues/55106#note-17

Hi all,
I've installed and tried today the crawler latest dev version https://git.typo3.org/TYPO3CMS/Extensions/crawler.git/snapshot/20bfb07561b5e15f2324fd84751c633cb049dacd.zip on Typo3 6.2.7, PHP 5.5.9.
Unfortunately it doesn't seem to want to queue anything.

This is my process (following mostly the howto: http://xavier.perseguers.ch/tutoriels/typo3/articles/indexed-search-crawler.html) :
add a crawler record at the page tree root named simple-indexing
truncate all concerned mysql tables to have something clean
go to a given normal page with Web > info and have a look at Indexed Search (OK no entries)
stay in Web > info, but choose now Site crawler
choose the simple-indexing, and clic on Update and then on Crawl URLs
Here: Update refreshes the page but Count stays at zero.
Crawl URLs leads to another page and we can see "0 URL submitted".
I've done the same process this morning with the 3.5.0 crawler extension on the TER: the Count goes to 1, and I had "1 URL submitted", but starting the job manually has given no indexing results. It's how I've seen the current bug report here, and why I would to find a crawler running with my Typo3 version.

Thanks for your help.

Migrated from: https://forge.typo3.org/issues/68949

HTTP Authentication support

It would be very nice to have a HTTP Authentication support for the crawler, so websites (intranet mainly) that are protected by a htaccess authentication could be crawled too, especially if it's not possible to exclude the server itself from the authentication procedure.

Migrated from: https://forge.typo3.org/issues/47595

"crawl" context menu doesn't work - fatal error on TYPO3 6.2.19

Steps:

Go to a page where you have crawler configuration record
open a context (right click) menu on the crawler record
choose "crawler" from the menu

Result:
Fatal error: Class 'TYPO3\CMS\Core\Utility\GeneralUtility' not found in typo3/sysext/info/mod1/index.php on line 21

Replace deprecated parameter in ExtensionManagementUtility::insertModuleFunction call

Re-add type hint to tx_crawler_hooks_staticFileCacheCreateUri::initialize

The type hint for $parent has been removed as a bugfix with #66545. With switching to namespaces, we should add it again:

@public function initialize(array $parameters, SFC\NcStaticfilecache\StaticFileCache $parent) {

Migrated from: https://forge.typo3.org/issues/68727

Add scheduler task for process cleanup hook

TCA field 'base_url' displayCond is broken

The display condition in the 'base_url' column in TCA is broken. While it seems this isn't a big deal as you could potentially use the base url from the domain record - it is when you're trying to crawl via https:// as 'base_url' is the only way to achieve this...

Pull request follows shortly..

Hidden records should be ignored

I have a page where I display data records that are organized in a sysfolder. I'm using the following configuration to crawl the different contents (page uid is 13, sysfolder uid is 15):

tx_crawler.crawlerCfg.paramSets {
items = &tx_myext[items]=[_TABLE:tt_myext_items;_PID:15]
items {
pidsOnly = 13
cHash = 1
procInstrFilter = tx_indexedsearch_reindex
}
}
The problem: Hidden records in the sysfolder are still added to the queue and later on indexed as empty page, because the hidden content can not be shown. This will also result in entries in search results for users. These results will lead the user to an empty page, which is pretty bad.

Expected behavior: Hidden records should be ignored when building the initial crawler queue.

It seems that a similar problem is described in #7455, but that bug is closed.

IMPORTANT: See comments/discussion here: https://forge.typo3.org/issues/43655

Fix CLI: incorrect start index / high memory consumption.

Two related issues were posted to the forge bug tracker:

currently in CLI mode the startpage argument is used with wrong index. This leads to the problem that the crawler will alsways start crawling from id 0.
currently the crawler eats lots of memory and is running in timeouts even in CLI context.

A patch to address these issues was commited as well. It uses the passed page id instead of 0 as start pages. This patch should be merged in the Github repo as well.

It's not possible to mass-edit "Base url" field in the crawler configuration

Steps to reproduce:

Go to the list view
Click on the "Crawler Configuration" table header
Add "Base url" column to the table

Actual: base url column is empty
Expected: base url column is filled with data

Click on the pencil icon to mass edit

Actual: base url selectors will not appear
Expected: base url selectors will be visible

TYPO3 6.2.19

Convert to PSR-2

Migrated from: https://forge.typo3.org/issues/68499

Refactor cli_* classes to one class with several options

Splitting Extension configuration into tabs

To have a better overview of the configurations options it would be nice to split the extension configuration into tabs.

When this is done, please update the documentation with new screenshots.

Migrated from: https://forge.typo3.org/issues/69003

Update Documentation

Proof read the documentation, and update accordingly.

Add the AOE Way
See text here: https://docs.typo3.org/typo3cms/extensions/restler/Pages/Aoe.html
Add links section https://docs.typo3.org/typo3cms/extensions/cloudflare/Links.html

AJAX Call in CE Content Menu throws PHP Fatal Error

TYPO3 7.6
crawler 5.0

With the crawler extension installed the CE context menu stops working and throws a PHP Fatal Error each time it is clicked. From the log:

Core: Exception handler (WEB): Uncaught TYPO3 Exception: #1: PHP Catchable Fatal Error: Argument 1 passed to tx_crawler_contextMenu::main() must be an instance of clickMenu, instance of TYPO3\CMS\Backend\ClickMenu\ClickMenu given, called in /var/www/staging.example.org/typo3_src-7.6.0/typo3/sysext/backend/Classes/ClickMenu/ClickMenu.php on line 488 and defined in /var/www/staging.example.org/public/typo3conf/ext/crawler/class.tx_crawler_contextMenu.php line 43 | TYPO3\CMS\Core\Error\Exception thrown in file /var/www/staging.example.org/typo3_src-7.6.0/typo3/sysext/core/Classes/Error/ErrorHandler.php in line 111. Requested URL: http://staging.example.org/typo3/index.php?ajaxID=%2Fajax%2Fcontext-menu&ajaxToken=a3774c8d3796a751e31882c20de9e5c4c582a53b&table=tt_content&uid=227&listFr=1

Hope that helps!

MultiCrawler Output

Yesterday I tested and accepted #69331 but i'm not sure it's the right approach.

The setting processDebug is one thing and processVerbose is different, so i'm considering changing the $this->verbose setting to be used from processVerbose, which is a new setting that has to be introduced then.

Feedback appreciated.

IMPORTANT: Please see dicussion/questions here: https://forge.typo3.org/issues/69341 Furture discussion will continue here.

Refresh icon (refresh_n.gif) not showed in TYPO3 7 LTS

The file /typo3/gfx/refresh_n.gif seems no longer to be present in TYPO3 7 LTS.

Something went wrong: process did not appear within 10 seconds.

Hi,

Just upgraded the crawler from 3.2.0 (TYPO3 4.5.22) to the latest 3.5.0 and it seems it does not work anymore.
When trying to add a process the error:

"Something went wrong: process did not appear within 10 seconds."

appears.

I have tried this on my local Windows dev machine and on the Linux server itself, and both seem to fail.

PHP version 5.3.15 on Windows and 5.3.3 on Linux.

IMPORTANT: See dicussion/questions at: https://forge.typo3.org/issues/44875

tx_crawler_lib::flushQueue() may unnecessarily eat tons of memory

Here is the relevant code from that function:

    if(tx_crawler_domain_events_dispatcher::getInstance()->hasObserver('queueEntryFlush')) {
        $groups = $GLOBALS['TYPO3_DB']->exec_SELECTgetRows('DISTINCT set_id','tx_crawler_queue',$realWhere);
        foreach($groups as $group) {
            tx_crawler_domain_events_dispatcher::getInstance()->post('queueEntryFlush',$group['set_id'], $GLOBALS['TYPO3_DB']->exec_SELECTgetRows('uid, set_id','tx_crawler_queue',$realWhere.' AND set_id="'.$group['set_id'].'"'));
        }
    }

Imagine what happens if observer does not need any row but exec_SELECTgetRows() returns 90000 rows. Would that be efficient? Let's count at least data: 32 bits = 4 bytes * 90000 = 360K of data. Add here PHP internal array structures and you'll get the idea how much memory this code needs. It is well beyond a typical 128K value that most servers have for TYPO3.

Much better is to supply a database resource, not the data. Then the function may do whatever it wants with the resource (seek, fetch, etc). Even better would be supply SQL query information about data, not the data itself.

Migrated from: https://forge.typo3.org/issues/8084

config.absRefPrefix not respected by tx_crawler_lib::getFrontendBasePath()

GIVEN the frontendBasePath extension setting is empty
AND config.absRefPrefix is set
THEN frontendBasePath should use the value from config.absRefPrefix

It seems $GLOBALS['TSFE'] is empty.

Migrated from: https://forge.typo3.org/issues/68950

Removed Commented lines with require

Crawling Processes Page Information - Refresh button does not work as expected in TYPO3 6.2

This is related to the change made in relation to this issue #37 (52c7b2d).

The commit introduces a regression.

Steps to reproduce in crawler version 5.0.1, TYPO3 6.2 or 7.6.2:

Add pages to the crawler queue (5-10 pages)
Manually start crawling processes (Add process)
Click reload (maybe multiple times).
When the processes are finished and exit, you will start seeing the error (because instead of refreshing you keep adding processes):

It seems that the action which is executed on "Refresh" is still "Add process" or most probably the last executed task.

Add code-coverage upload to travis

Add Travis Config for both TYPO3 6 and 7LTS

reduce CLI memory consumption

Currently the crawler eats lots of memory and is running in timeouts even in CLI context

IMPORTANT:
See https://forge.typo3.org/issues/64491 for more info, future dicussion etc. will take here at github.

Remove assignment of queue items to process if process gets killed by cleanup hook

Remove deprecated call of GeneralUtility::loadTCA()

Related: #11

Split functional and unit testing

Split testing into Functional and Unit + Add the Functional to the Travis-CI testing..

Related to #13

Removed Version comparisons for 4.x

Update BE Icon to work with TYPO3 v7.x

There was a similar update in ext:phpunit, inspiration can be found here:
https://forge.typo3.org/issues/64645

Migrated from: https://forge.typo3.org/issues/68993

Missing "hide"/"disable" field in the configuration record form

The crawler configuration record page misses the "hide" checkbox. The only way to disable the record is to do it from list module.

Links on refresh icon/queue id do not work in TYPO3 7.6

The links on the refresh icon/queue id in the crawler log list do not contain all needed parameters:
index.php?id=1&qid_read=1&setID=0
should be:
/typo3/index.php?M=web_info&moduleToken={token}&qid_read=1&setID=0&id=1

I hope I'm in the right place here on github, I'll sumbit a pull request with a fix. If you prefer, I can submit the patch to gerrit, just let me know.

Namespaces for tests

Migrated from: https://forge.typo3.org/issues/68500

Core Bug #70052 isn't as resolved as you might think

See my comment on this commit: AOEpeople@9d9fe40

This 'bug' hasn't been resolved, the EXT: display condition has just been marked as deprecated and then they called it a day. So in 7.6.4 you still get an uncaught TYPO3 exception in the backend when the column sys_worspace_uid is defined and the EXT:versions is not installed..

I supplied a pull request reverting this commit..

Replace deprecated function calls

Migrated from: https://forge.typo3.org/issues/68502

Namespaces for classes

Migrated from: https://forge.typo3.org/issues/68501

Not working with TYPO3 7.6 / usage of invalid functions

The crawler ext. uses function \TYPO3\CMS\Core\Utility\GeneralUtility::loadTCA but this function is not available in TYPO3 version 7.6

Enable Workspace

The Workspace is currently disabled, due to a bug in the core.

The bug is solved in master.
https://forge.typo3.org/issues/70052

Remove "Add Process"-button when no more queue-item can be assigned

UX Fix:

When the Pending Entries (assigned / overall ): 30 / 30 (equal each other) then the "add process" button should not be visible, as it will return and error that the process cannot be added.

We don't need this visible as it don't make sense to click it as no entries can be assigned anyway.

FEUser GroupList not resolved recursively when crawling

Unfortunately the list of groups a FE-User is member of is not resovled recursively.

It works with one level of subgroups, but if the subgroup has subgroups (and so on) different grlist entries are made the indexed_search pHash Tab (see Screenshot -> 1st entry made by manually opening the page in the browser, second row by crawler extension with same FE-Group entry).

Maybe setting the group-list in tx_crawler_hooks_tsfe->fe_feuserInit is not enough . .