google-cloudsearch / norconex-committer-plugin Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 7.0 6.78 MB

Google Cloud Search Norconex HTTP Collector Indexer Plugin

License: Apache License 2.0

Java 52.06% HTML 47.94%

norconex-committer-plugin's People

Contributors

Stargazers

Watchers

Forkers

avontd2868 saschaheyer joelvoss angelo337 wiarlawd classicvalues isabella232

norconex-committer-plugin's Issues

Getting error while committing docs to google cloud search datastore. Below is the error example and this is happening all of a sudden

INFO [HttpCrawler] 2 start URLs identified.
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] bayer-default: Crawling references...
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://www.bayer.com.tw/
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.bayer.com.tw/zh-hant/
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.bayer.com.tw/zh-hant/
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-09/hr.jpg?h=341981b4&itok=cjW2HXv9
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-09/teaser-nav-commit.jpg?h=fd24c189&itok=_2ttr7tf
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/16_9_aspect_ratio/public/2020-11/movingimages05.jpg?h=d19103a9&itok=8abowRlj
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.bayer.com.tw/zh-hant/bayer-innovation
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/16_9_aspect_ratio/public/2020-11/movingimages02.jpg?h=bf3ccb75&itok=-iKRgKQx
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.bayer.com.tw/zh-hant/bayer-innovation
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/16_9_small/public/2020-11/Receptionist%20talking%20phone_426.jpg?h=656682cd&itok=oVEc4lU6
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/16_9_aspect_ratio/public/2020-11/movingimages01.jpg?h=d19103a9&itok=PoKOHP28
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/16_9_small/public/2020-08/consumer-health.jpg?h=c397aecc&itok=WrOpXWdU
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-08/taiwan.png?h=fd24c189&itok=Wf0Gpu5H
INFO [CrawlerEventManager] URLS_EXTRACTED: https://www.bayer.com.tw/zh-hant/conditions-of-use
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://www.bayer.com.tw/zh-hant/conditions-of-use
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://www.bayer.com.tw/zh-hant/bayer-innovation
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://www.bayer.com.tw/zh-hant/
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://www.bayer.com.tw/zh-hant/bayer-innovation (GoogleCloudSearchCommitter[queueSize=100,docCount=62872,queue=FileSystemCommitter[directory=../workdir/queue],commitBatchSize=10,maxRetries=0,maxRetryWait=0,operations=[],targetReferenceField=,sourceReferenceField=,keepSourceReferenceField=false,targetContentField=,sourceContentField=,keepSourceContentField=false])
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://www.bayer.com.tw/zh-hant/ (GoogleCloudSearchCommitter[queueSize=100,docCount=62872,queue=FileSystemCommitter[directory=../workdir/queue],commitBatchSize=10,maxRetries=0,maxRetryWait=0,operations=[],targetReferenceField=,sourceReferenceField=,keepSourceReferenceField=false,targetContentField=,sourceContentField=,keepSourceContentField=false])
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://www.bayer.com.tw/zh-hant/conditions-of-use (GoogleCloudSearchCommitter[queueSize=100,docCount=62873,queue=FileSystemCommitter[directory=../workdir/queue],commitBatchSize=10,maxRetries=0,maxRetryWait=0,operations=[],targetReferenceField=,sourceReferenceField=,keepSourceReferenceField=false,targetContentField=,sourceContentField=,keepSourceContentField=false])
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://www.bayer.com.tw/node/
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://www.bayer.com.tw/rss
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.bayer.com.tw/sites/bayer_com_tw/files/bayer-organizational-structure-2020-08-21.pdf
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.bayer.com.tw/sites/bayer_com_tw/files/bayer-organizational-structure-2020-08-21.pdf
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.bayer.com.tw/themes/custom/bayer_cpa/logo.svg
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.bayer.com.tw/themes/custom/bayer_cpa/logo.svg
INFO [CrawlerEventManager] REJECTED_IMPORT: https://www.bayer.com.tw/themes/custom/bayer_cpa/logo.svg
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://www.bayer.com.tw/en/node/2
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.bayer.com.tw/zh-hant/advanced-search
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.bayer.com.tw/zh-hant/advanced-search
INFO [CrawlerEventManager] URLS_EXTRACTED: https://www.bayer.com.tw/en/node/56
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://www.bayer.com.tw/en/node/556
Dec 09, 2020 10:50:11 PM com.google.enterprise.cloudsearch.sdk.indexing.IndexingServiceImpl getSchema
WARNING: Schema lookup failed. Using empty schema
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alert.createSSLException(Alert.java:131)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369)
at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377)
at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422)
at sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:149)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1143)
at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1054)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1340)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1315)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:264)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:283)
at com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:307)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential.executeRefreshToken(GoogleCredential.java:394)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:217)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:868)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
at com.google.enterprise.cloudsearch.sdk.BaseApiService.executeRequest(BaseApiService.java:429)
at com.google.enterprise.cloudsearch.sdk.indexing.IndexingServiceImpl.getSchema(IndexingServiceImpl.java:1143)
at com.google.enterprise.cloudsearch.sdk.indexing.StructuredData.initFromConfiguration(StructuredData.java:199)
at com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter.init(GoogleCloudSearchCommitter.java:204)
at com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter.commitBatch(GoogleCloudSearchCommitter.java:234)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
at com.norconex.committer.core.AbstractBatchCommitter.commitDeletion(AbstractBatchCommitter.java:148)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:225)
at com.norconex.committer.core.AbstractCommitter.commitIfReady(AbstractCommitter.java:146)
at com.norconex.committer.core.AbstractCommitter.add(AbstractCommitter.java:97)
at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:34)
at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:27)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeCommitterPipeline(HttpCrawler.java:380)
at com.norconex.collector.core.crawler.AbstractCrawler.processImportResponse(AbstractCrawler.java:600)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:541)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:829)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:456)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:323)
at sun.security.validator.Validator.validate(Validator.java:271)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:223)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:129)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:638)
... 48 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:451)
... 54 more

INFO [GoogleCloudSearchCommitter] Indexing Service reference count: 1
INFO [GoogleCloudSearchCommitter] Sending 10 documents to Google Cloud Search for addition/deletion.
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://www.bayer.com.tw/en/node/556 (GoogleCloudSearchCommitter[queueSize=100,docCount=62911,queue=FileSystemCommitter[directory=../workdir/queue],commitBatchSize=10,maxRetries=0,maxRetryWait=0,operations=[],targetReferenceField=,sourceReferenceField=,keepSourceReferenceField=false,targetContentField=,sourceContentField=,keepSourceContentField=false])
INFO [GoogleCloudSearchCommitter] Document deleted (38ms): https://www.cropscience.bayer.ca/en/Products/Fungicides/Prosaro-west/Quality
INFO [GoogleCloudSearchCommitter] Document deleted (0ms): https://www.cropscience.bayer.ca/en/Products/Fungicides/Prosaro-west/Quantity
INFO [GoogleCloudSearchCommitter] Document deleted (0ms): https://www.cropscience.bayer.ca/en/Products/Fungicides/Scala
INFO [GoogleCloudSearchCommitter] Indexing Service release reference count: 1
INFO [GoogleCloudSearchCommitter] Stopping indexingService: 0
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.bayer.com.tw/en/node/571
Dec 09, 2020 10:50:11 PM com.google.enterprise.cloudsearch.sdk.BatchRequestService shutDown
INFO: Shutting down batching service. flush on shutdown: true
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.bayer.com.tw/en/node/571
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://www.bayer.com.tw/en/node/56
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-09/duty_170x100.jpg?h=88f562ca&itok=8eEavwXI
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/2020-08/hospital-science-01.jpg
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-08/taiwan.png?h=fd24c189&itok=Wf0Gpu5H
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/inline-images/hospital-science-02.png
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-11/Newspaper_production.jpg?h=78276bf5&itok=5SH9XPoW
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-09/teaser-nav-commit.jpg?h=fd24c189&itok=_2ttr7tf
INFO [CrawlerEventManager] REJECTED_FILTER: https://www.bayer.com.tw/sites/bayer_com_tw/files/styles/280x160/public/2020-09/teaser-nav-news.jpg?h=a0a0c8ec&itok=_QhViWhe
Dec 09, 2020 10:50:12 PM com.google.enterprise.cloudsearch.sdk.BatchRequestService$SnapshotRunnable getGoogleJsonError
WARNING: Retrying request failed with exception:
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alert.createSSLException(Alert.java:131)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369)
at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377)
at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422)
at sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:149)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1143)
at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1054)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1340)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1315)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:264)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:283)
at com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:307)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential.executeRefreshToken(GoogleCredential.java:394)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:217)
at com.google.api.client.googleapis.batch.BatchRequest$BatchInterceptor.intercept(BatchRequest.java:300)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:868)
at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
at com.google.enterprise.cloudsearch.sdk.BatchRequestService$BatchRequestHelper.executeBatchRequest(BatchRequestService.java:447)
at com.google.enterprise.cloudsearch.sdk.BatchRequestService$SnapshotRunnable.execute(BatchRequestService.java:308)
at com.google.enterprise.cloudsearch.sdk.BatchRequestService$SnapshotRunnable.run(BatchRequestService.java:238)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:456)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:323)
at sun.security.validator.Validator.validate(Validator.java:271)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:223)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:129)
at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:638)
... 31 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:451)
INFO [GoogleCloudSearchCommitter] Indexing Service release reference count: 1
INFO [GoogleCloudSearchCommitter] Stopping indexingService: 0
Dec 09, 2020 10:53:58 PM com.google.enterprise.cloudsearch.sdk.BatchRequestService shutDown
INFO: Shutting down batching service. flush on shutdown: true
INFO [GoogleCloudSearchCommitter] Shutting down (took: 2ms)!
INFO [GoogleCloudSearchCommitter] Indexing Service reference count: 0
INFO [AbstractCrawler] bayer-default: Crawler executed in 8 minutes 2 seconds.
INFO [SitemapStore] bayer-default: Closing sitemap store...
ERROR [JobSuite] Execution failed for job: bayer-default
INFO [JobSuite] Running bayer-default: END (Wed Dec 09 22:45:55 UTC 2020

I have checked for jre keystore, no certificate has expired recenly. Also updated my SDK to latest version.
But nothing worked. I am getting this error irrespective of domains I am trying to index.

Handling API Error Message: Stale version number specified.

When committing the item to GCS,

it will be randomly show an API error in the following.

WARNING: Request failed with error
{
    "code": 400,
    "errors": [
        {
            "domain": "global",
            "message": "Stale version number specified. ",
            "reason": "failedPrecondition"
        }
    ],
    "message": "Stale version number specified. ",
    "status": "FAILED_PRECONDITION"
}

I would like to ask if there is any work around to track the issue, such as show up the error item id, etc.

Thanks.

Peter Chan

Preparing for Committer Core V3 release

Hello @donghanmiao, this is Pascal from Norconex. This is to inform you we are working on the next major release of our crawler stack and that includes a major release of Committer Core. There are snapshot versions available of Committer Core V3, in addition to the code already in the project master branch. It adds new features such as retrying or splitting failing batches, new life-cycle events to init/clean properly, and more.

Being a major version, it has API breaking changes. I am writing to you to know your approach to keeping up with major releases. Would you appreciate a pull request from us, or do you prefer having a go at it yourself?

Norconex Collector or Cloud Search Committer process not exiting

We have observed a periodic problem, where the Norconex collector process does not exit even when the log indicates the process is completed. We are running committer plugin v1-0.0.5 with Norconex collector (v2.9.1) in a docker container based on instructions found here. In the start script we have a line to echo a success or error when the collector process returns, but occasionally we may not see the return. Please see attached log entries below

A 2020-07-11T16:53:11.095961101Z INFO  [GoogleCloudSearchCommitter] Document (text/html) indexed (213 KB / 1288ms): https://a.domain.com/index.html
A 2020-07-11T16:53:11.096007563Z INFO  [GoogleCloudSearchCommitter] Indexing Service release reference count: 1
A 2020-07-11T16:53:11.096015720Z INFO  [GoogleCloudSearchCommitter] Stopping indexingService: 0 
A 2020-07-11T16:53:11.097438429Z Jul 11, 2020 4:53:11 PM com.google.enterprise.cloudsearch.sdk.BatchRequestService shutDown 
A 2020-07-11T16:53:11.097483701Z INFO: Shutting down batching service. flush on shutdown: true 
A 2020-07-11T16:53:14.808163240Z INFO  [GoogleCloudSearchCommitter] Shutting down (took: 3712ms)! 
A 2020-07-11T16:53:14.808202056Z INFO  [GoogleCloudSearchCommitter] Indexing Service reference count: 0 
A 2020-07-11T16:53:14.946960495Z INFO  [AbstractCrawler] Crawler A: 4310 reference(s) processed. 
A 2020-07-11T16:53:14.947056707Z INFO  [CrawlerEventManager]          CRAWLER_FINISHED 
A 2020-07-11T16:53:14.947143098Z INFO  [AbstractCrawler] Crawler A: Crawler completed. 
A 2020-07-11T16:53:14.948546999Z INFO  [AbstractCrawler] Crawler A: Crawler executed in 5 hours 1 minute 18 seconds. 
A 2020-07-11T16:53:14.948568748Z INFO  [SitemapStore] Crawler A: Closing sitemap store... 
A 2020-07-11T16:53:14.953454126Z INFO  [JobSuite] Running Crawler A: END (Sat Jul 11 11:51:56 UTC 2020) 
<!-- occasionally the log entries stop at above line, and VM process will stall and consume minimal CPU resource. We expect the log entries below to display after index items are committed and crawl completes successfully -->
2020-07-15Txx:xx:xxZ INFO  [JobSuite] Running Cloud Search HTTP Collector: END 
2020-07-15Txx:xx:xxZ crawl process exited successfully   <--- echoed from the start.sh after command 'collector-http.sh -a start -c ...', but does not always reach this point

start.sh

#!/bin/bash
#set -x
#set -e

${CRAWLER_HOME}/collector-http.sh -a start -c ${CRAWLER_HOME}/config/crawler-config.xml
if [ $? == 0 ]
then
	echo "$(date) crawl process exited successfully";
else
	echo "$(date) - Error occurred running crawler.";
fi

Is there a good explanation why this happens? Could it be an issue with the committer? or has the log highlighted it is an issue of the collector? I am looking for an idea to troubleshoot this issue, wonder if anyone has seen this or could provide any direction? Appreciate your time on this.

Committer plugin installation causes meta tag indexing issues

We found a issue related to the dependencies which is the committer plugin installing into the lib folder. The issues causes a behavior where meta tags are not getting extracted properly.

The behavior is only reproducible if the body of the page contains a small amount of content. On pages with large content the issue is not reproducible.

To reproduce the behavior please use the following files:

https://storage.googleapis.com/sascha-issue-reproduction/emptyBody.html
Contains an empty body, which obvious leads to an empty .cntnt file. But the existing meta tag is not extracted.
https://storage.googleapis.com/sascha-issue-reproduction/smallBody.html
Contains a small amount of text in the body, but still the .cntnt file is empty and the meta tag is still not extracted.
https://storage.googleapis.com/sascha-issue-reproduction/largeBody.html
After adding more content to the body the content and the meta tag is extracted properly.

The html files contain a meta tag for testing

<meta name="test" content="test" />

After some testing we found out that this issue occurs as soon as we install the Cloud Search Norconex HTTP Collector committer plugin.

Steps to reproduce working case

This reproduction step can be used to verify that the meta tag extraction is working properly in the norconex default setup.

Install Norconex HTTP Collector
(without Cloud Search Norconex HTTP Collector committer plugin)
add the start URLs

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
    <url>https://storage.googleapis.com/sascha-issue-reproduction/emptyBody.html</url>
    <url>https://storage.googleapis.com/sascha-issue-reproduction/smallBody.html</url>
    <url>https://storage.googleapis.com/sascha-issue-reproduction/largeBody.html</url>
</startURLs>

As committer we use the FileSystemCommitter
Start the crawler
open the .meta files for all 3 pages
Search for the meta tag

test = test

The meta tag can be found in all 3 .meta files
Everything works as expected 👍

Steps to reproduce failure case

This reproduction step can be used to reproduce the error case.

Take the Norconex installation from the previous step and Install the Google Cloud Search Norconex committer plugin.
Delete all files which are previously crawled from the /crawledFiles folder
Start the crawler
open the .meta files for all 3 pages
Search for the meta tag

test = test

The meta tag can be found only in the file with large content
For the other files the meta tag is not extracted 👎

Versions

Norconex: 2.8.1
Cloud Search Norconex HTTP Collector committer plugin: latest version master (1f7585b)

Keep in mind

If you reproduce the behavior with the Norconex example configuration please keep in mind to remove:

<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
    <fields>title,keywords,description,document.reference</fields>
</tagger>

Best regards
Sascha

Committer is leading to lock issues with the mvstore

Hi Google Team,

we found a critical issues when running Norconex in a scheduled manner (short crawl interval of 1 minute).
When Norconex is indexing documents and is done with indexing the norconex process is marked as completed even though the committer plugin is still committing files.

This leads to lock errors when the next scheduled crawl run is starting and the crawler fails.

The file is locked: nio:./obfuscated-output/crawlstore/mvstore/obfuscated/mvstore [1.4.199/7]

The issues is reproducible with the following steps:

Start a crawl run every minute (using crontab for example)
Index a large file or multiple files)
You can use our test file 100MB https://storage.googleapis.com/norconex-issue-reproduction/test.pdf
The first crawl run is working as expected
The crawl runs after this are failing with lock error

Please let me know if you need further information

Best regards
Sascha

Show "Unable to upload default ACL" randomly

Hi,

I am using GKE to run the norconex crawler with this plugin.
There are about 6 crawler job with a Mongodb for datastore. The jobs show below.

gcs_user@cloudshell:~/deployment (ai-gcsimpl-uat-235207)$ kubectl get pods
NAME                          READY   STATUS      RESTARTS   AGE
crawler-job-acer-tnsx8        1/1     Running     0          12h
crawler-job-acerpro-q77bk     1/1     Running     0          12h
crawler-job-community-hbzp7   1/1     Running     0          12h
crawler-job-custhelp-jgn4s    0/1     Completed   0          12h
crawler-job-datasheet-hd684   1/1     Running     0          12h
crawler-job-ec-wxdb5          1/1     Running     0          12h
mongo-0                       2/2     Running     0          10h

I meet a problem that the crawlers shows up "Unable to upload default ACL" error randomly in all crawlers. Below is the error log which show up recently.
I have double checked the configuration and they are correct. I can provide them if needed.

ERROR [JobSuite] Execution failed for job: webcrawler-datasheet
com.google.enterprise.cloudsearch.sdk.StartupException: Unable to upload default ACL.
        at com.google.enterprise.cloudsearch.sdk.indexing.DefaultAcl.<init>(DefaultAcl.java:220)
        at com.google.enterprise.cloudsearch.sdk.indexing.DefaultAcl.<init>(DefaultAcl.java:93)
        at com.google.enterprise.cloudsearch.sdk.indexing.DefaultAcl$Builder.build(DefaultAcl.java:456)
        at com.google.enterprise.cloudsearch.sdk.indexing.DefaultAcl.fromConfiguration(DefaultAcl.java:266)
        at com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter$Helper.initDefaultAclFromConfig(GoogleCloudSearchCommitter.java:378)
        at com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter.init(GoogleCloudSearchCommitter.java:170)
        at com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter.commitBatch(GoogleCloudSearchCommitter.java:203)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:131)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException
        at com.google.enterprise.cloudsearch.sdk.AsyncRequest$SettableFutureCallback.onFailure(AsyncRequest.java:134)
        at com.google.api.client.googleapis.batch.json.JsonBatchCallback.onFailure(JsonBatchCallback.java:54)
        at com.google.api.client.googleapis.batch.json.JsonBatchCallback.onFailure(JsonBatchCallback.java:50)
        at com.google.api.client.googleapis.batch.BatchUnparsedResponse.parseAndCallback(BatchUnparsedResponse.java:223)
        at com.google.api.client.googleapis.batch.BatchUnparsedResponse.parseNextResponse(BatchUnparsedResponse.java:155)
        at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:253)
        at com.google.enterprise.cloudsearch.sdk.BatchRequestService$BatchRequestHelper.executeBatchRequest(BatchRequestService.java:427)
        at com.google.enterprise.cloudsearch.sdk.BatchRequestService$SnapshotRunnable.execute(BatchRequestService.java:297)
        at com.google.enterprise.cloudsearch.sdk.BatchRequestService$SnapshotRunnable.run(BatchRequestService.java:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
INFO  [JobSuite] Running webcrawler-datasheet: END (Wed Apr 17 14:29:52 UTC 2019)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.