Giter Site home page Giter Site logo

norconex / committer-elasticsearch Goto Github PK

View Code? Open in Web Editor NEW
10.0 11.0 6.0 3.29 MB

Implementation of Norconex Committer for Elasticsearch.

Home Page: https://opensource.norconex.com/committers/elasticsearch/

License: Apache License 2.0

Java 100.00%
committer-elasticsearch norconex-committer

committer-elasticsearch's Introduction

committer-elasticsearch's People

Contributors

essiembre avatar kalhomoud avatar pascaldimassimo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

committer-elasticsearch's Issues

Must I create index in Elasticsearch when using committer?

Hi,

Hi I would like to use HTTP Collector with Elsticsearch Committer but do not know how to setup elasticsearch that will receive data from Elastic Committer.

here is my config in HTTP Collector:


<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
         <url>http://www.example.com</url>
		
      </startURLs>
      <robotsTxt ignore="true" />
      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>20</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="1" />
      
      
      <!-- Decide what to do with your files by specifying a Committer. -->

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <nodes>http://192.168.0.61:9200/</nodes>	
    <indexName>moj</indexName>
	<typeName>mojtype</typeName>
    <ignoreResponseErrors>false</ignoreResponseErrors>
	 
</committer>

    </crawler>
  </crawlers>

</httpcollector>

Have I create elasticsearch index with name "moj" ?

How I create typeName "mojtype" ?

Are index and type created automatically when committed to elasticsearch?

It is not clear what fields pump Elasticsearch Committer to Elasticsearch or what is index mapping for Elasticsearch.

Thanks.

Field types as integer instead of text

I would like to be able to use the collector.lastmodified and collector.filesize as sort fields. However, since they are mapped as "text" values, the sort does not work correctly on "text" values. I tried mapping the field values as integer but got this error on commit:

{
    "_index": "wmsearch",
    "_type": "doc",
    "_id": "file:///c:/xxx",
    "status": 400,
    "error": {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [collector.lastmodified]",
        "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"1495129910000\""
        }
    }
}

Is it possible to change these field values from text to integer?

I can think of a possible work-around such as creating a new field (called something else) with "integer" type, and use scripttagger or similar to copy the value to the new field so it can be used as a sort field in ES. However, I am hoping this could be updated in a snapshot version. Should I perform the workaround or do you think this could be changed in the ES committer?

Thanks

id is too long, must be no longer than 512 bytes but was: 520

Running into an error commiting to elastic. I assume this "_id" from kibana which appears to be the url of the page, & same as "document.reference"

screen shot 2017-11-17 at 15 04 33

INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] Anonymous Coward: Crawler executed in 12 seconds.
INFO  [SitemapStore] Anonymous Coward: Closing sitemap store...
ERROR [JobSuite] Execution failed for job: Anonymous Coward
com.norconex.committer.core.CommitterException: Could not commit JSON batch to Elasticsearch.
	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:449)
	at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
	at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
	at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
	at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
	at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:270)
	at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:226)
	at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
	at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
	at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
	at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
	at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
	at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
	at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
	at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: org.elasticsearch.client.ResponseException: POST http://10.80.99.54:9200/_bulk: HTTP/1.1 400 Bad Request
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 520;2: id is too long, must be no longer than 512 bytes but was: 520;3: id is too long, must be no longer than 512 bytes but was: 520;4: id is too long, must be no longer than 512 bytes but was: 520;5: id is too long, must be no longer than 512 bytes but was: 520;6: id is too long, must be no longer than 512 bytes but was: 520;7: id is too long, must be no longer than 512 bytes but was: 520;8: id is too long, must be no longer than 512 bytes but was: 520;9: id is too long, must be no longer than 512 bytes but was: 520;10: id is too long, must be no longer than 512 bytes but was: 558;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 520;2: id is too long, must be no longer than 512 bytes but was: 520;3: id is too long, must be no longer than 512 bytes but was: 520;4: id is too long, must be no longer than 512 bytes but was: 520;5: id is too long, must be no longer than 512 bytes but was: 520;6: id is too long, must be no longer than 512 bytes but was: 520;7: id is too long, must be no longer than 512 bytes but was: 520;8: id is too long, must be no longer than 512 bytes but was: 520;9: id is too long, must be no longer than 512 bytes but was: 520;10: id is too long, must be no longer than 512 bytes but was: 558;"},"status":400}
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:354)
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:343)
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
	at java.lang.Thread.run(Thread.java:748)
INFO  [JobSuite] Running Anonymous Coward: END (Fri Nov 17 14:21:41 PST 2017)

Reversed logic in the JSON field name matcher

Method below

    private void appendValue(StringBuilder json, String field, String value) {
        if (getJsonFieldsPattern() != null 
                && getJsonFieldsPattern().matches(field)) {
            json.append(value);
        } else {
            json.append('"')
                .append(StringEscapeUtils.escapeJson(value))
                .append("\"");
        }
    }

should have the json pattern match reversed

 private void appendValue(StringBuilder json, String field, String value) {
        if (getJsonFieldsPattern() != null 
                && field.matches(getJsonFieldsPattern())) {
            json.append(value);
        } else {
            json.append('"')
                .append(StringEscapeUtils.escapeJson(value))
                .append("\"");
        }
    }

No error for connecting to elasticsearch even though elasticsearch service is stopped?

Hi,
I've been messing around with the elasticsearch committer for a week or so, and for the life of me, I can't get the collector to commit to elasticsearch. There is no error when I run the collector, so it makes it very difficult to troubleshoot... I have a feeling that I'm just missing something small, but it could possibly be a bug.

Below is the xml config:

<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile).
           Optionally limit crawling to same protocol/domain/port as
           start URLs. -->
      <startURLs stayOnDomain="false" stayOnPort="true" stayOnProtocol="true">
        <url>http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf</url>
        <!--<urlsFile>/home/ec2-user/finalWebsiteList.txt</urlsFile> -->
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/testBucket</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>-1</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="false" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Document importing -->

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <indexName>brooks-test-app1</indexName>
        <typeName>test</typeName>
        <nodes>http://localhost:9200</nodes>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <queueSize>10</queueSize>
        <commitBatchSize>10</commitBatchSize>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

and here are the logs when I run it:

INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO  [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.7.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.8.2 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.7.2 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Elasticsearch 4.0.0 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Mon Nov 20 16:43:33 UTC 2017)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [StandardSitemapResolver] Resolving sitemap: http://s3.amazonaws.com/sitemap.xml
INFO  [StandardSitemapResolver]          Resolved: http://s3.amazonaws.com/sitemap.xml
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: 50% completed (2 processed/4 total)
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: 75% completed (3 processed/4 total)
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: 100% completed (5 processed/5 total)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 5 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 20 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Mon Nov 20 16:43:33 UTC 2017)

Error: java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator;

Hello, when running the crawler with multiple threads, I get the following error:

INFO  [AbstractCrawler] WM Search: 100% completed (23343 processed/23343 total)
INFO  [AbstractCrawler] WM Search: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] WM Search: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 124 files
INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Sending 24 commit operations to Elasticsearch.
INFO  [AbstractCrawler] WM Search: Crawler executed in 45 minutes 15 seconds.
FATAL [JobSuite] Fatal error occured in job: WM Search
INFO  [JobSuite] Running WM Search: END (Thu Sep 21 23:08:17 EDT 2017)
FATAL [JobSuite] Job suite execution failed: WM Search
java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator;
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.extractResponseErrors(ElasticsearchCommitter.java:493)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.handleResponse(ElasticsearchCommitter.java:469)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:442)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:273)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:227)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:183)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)

When I run with 1 thread it completes successfully:

INFO  [AbstractCrawler] WM Search: 100% completed (23343 processed/23343 total)
INFO  [AbstractCrawler] WM Search: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] WM Search: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 124 files
INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Sending 24 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] WM Search: 23343 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] WM Search: Crawler completed.
INFO  [AbstractCrawler] WM Search: Crawler executed in 1 hour 4 minutes 34 seconds.
INFO  [JobSuite] Running WM Search: END (Fri Sep 22 12:15:54 EDT 2017)

In both cases, I started clean by removing the index in ES, removing the committer-queue, and workdir files (just to be sure nothing was left over from previous runs). Here is my environment:

[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Filesystem Collector 2.7.2-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Collector Core 1.9.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex JEF 4.1.0 (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Committer Core 2.1.2-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Committer Elasticsearch 4.0.0 (Norconex Inc.)

and my config file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<fscollector id="Text Files">

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>


  <crawlers>
    <crawler id="WM Search">

      <workDir>${workdir}</workDir>

      <startPaths>
        <path>c:\Clients</path>
      </startPaths>

      <numThreads>1</numThreads>

      <keepDownloads>false</keepDownloads>

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
          (.*\/~.+|.*umbs\.db|.*\.shs|.*\.lnk|.*/\%23.+)
        </filter> 
      </documentFilters>

      <importer>
        <parseErrorsSaveDir>${workdir}/errors</parseErrorsSaveDir>
        <postParseHandlers>

          <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
            <script><![CDATA[
              metadata.setString('document.filename', 
              metadata.getString('document.reference').replace(/\.[^/.]+$/, "").replace(/^.*[\\\/]/,""));
            ]]></script>
          </tagger>

          <tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
              field="crawl_date" format="yyyy-MM-dd HH:mm" />

          <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
              overwrite="true" 
              titleMaxLength="60"
              detectHeading="true"
              detectHeadingMinLength="15"
              detectHeadingMaxLength="60"
              sourceCharset="(character encoding)">
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>http://localhost:9200</nodes>
        <indexName>wmsearch</indexName>
        <typeName>doc</typeName>
        <commitBatchSize>100</commitBatchSize>
      </committer>

    </crawler>
  </crawlers>

</fscollector>

I am not sure what is causing this issue. Please advise
Thanks in advance

java.lang.NoSuchFieldError: LUCENE_3_6

I am having trouble with the Elasticsearch committer. The crawler works fine but when it tries to send to Elasticsearch it get an "java.lang.NoSuchFieldError: LUCENE_3_6". I've tried looking around for the source of this error but ran out of ideas.

Here is the Exception output of the crawler:

INFO  - AbstractCrawler            - Crawler #1: Crawling references...
INFO  - AbstractCrawler            - Crawler #1: Deleting orphan references (if any)...
INFO  - AbstractCrawler            - Crawler #1: Deleted 0 orphan URLs...
INFO  - AbstractCrawler            - Crawler #1: Crawler finishing: committing documents.
INFO  - AbstractFileQueueCommitter - Committing 2 files
INFO  - ElasticsearchCommitter     - Sending 2 operations to Elasticsearch.
INFO  - AbstractCrawler            - Crawler #1: Crawler executed in 0 second.
INFO  - MapDBCrawlDataStore        - Closing reference store: ./crawl-output/myapp/crawlstore/mapdb/Crawler_32__35_1/
FATAL - JobSuite                   - Fatal error occured in job: Crawler #1
INFO  - JobSuite                   - Running Crawler #1: END (Mon Jun 01 17:00:57 EDT 2015)
FATAL - JobSuite                   - Job suite execution failed: Crawler #1
java.lang.NoSuchFieldError: LUCENE_3_6
    at org.elasticsearch.Version.<clinit>(Version.java:43)
    at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:138)
    at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
    at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)
    at com.norconex.committer.elasticsearch.DefaultClientFactory.createClient(DefaultClientFactory.java:44)
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:182)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:178)
    at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:158)
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:232)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:245)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:206)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:169)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:352)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:302)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:172)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)

Any ideas?

Thanks

Limit of total fields [1000] in index has been exceeded

ES committer continues to run (doesn't exit on error) but logs show the failure: Here is a snippet:

WM Search: 2017-09-24 18:23:23 INFO - Sending 100 commit operations to Elasticsearch.
WM Search: 2017-09-24 18:23:25 INFO - Elasticsearch RestClient closed.
WM Search: 2017-09-24 18:23:25 INFO - Elasticsearch RestClient closed.
WM Search: 2017-09-24 18:23:25 INFO -            REJECTED_ERROR: file:///c:/xxx.doc (com.norconex.committer.core.CommitterException: Elasticsearch returned one or more errors:
[{
    "_index": "wmsearch",
    "_type": "doc",
    "_id": "file:///c:xxxx",
    "status": 400,
    "error": {
        "type": "illegal_argument_exception",
        "reason": "Limit of total fields [1000] in index [wmsearch] has been exceeded"
    }
},
...
}]
	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.handleResponse(ElasticsearchCommitter.java:514)
	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:482)
	at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
	at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
	at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
	at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:427)
	at com.norconex.committer.core.AbstractCommitter.commitIfReady(AbstractCommitter.java:146)
	at com.norconex.committer.core.AbstractCommitter.add(AbstractCommitter.java:97)
	at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:34)
	at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:27)
	at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
	at com.norconex.collector.fs.crawler.FilesystemCrawler.executeCommitterPipeline(FilesystemCrawler.java:243)
	at com.norconex.collector.core.crawler.AbstractCrawler.processImportResponse(AbstractCrawler.java:586)
	at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:543)
	at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:418)
	at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:803)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
WM Search: 2017-09-24 18:23:25 INFO - DOCUMENT_METADATA_FETCHED: file:///c:...

I updated the field limits to 2000:

"index.mapping.total_fields.limit": 2000

This resolved the issue but suggest the committer exit on failure or have similar for committer (to that effect).

Not able to commit the processed items to Elastic search using norconex file system collector

Hi
I am using norconex filesystem collector to crawl files from shared path. I am trying the commit the processed items to Elastic search and File committer. It is not committing to Elastic search/Solr but getting saved into file system.
PFB the config file. Please help me to resolve the issue.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<fscollector id="Text Files">

## Either uncomment or set the following variables or create yourself a 
## sample-config.variables (or properties) with the same variables set.

#set($path = "valid path")
#set($workdir = "E:\filesystem\norconex-collector-filesystem-2.8.0\norconex-collector-filesystem-2.8.0\examples")

#set($tagger = "com.norconex.importer.handler.tagger.impl")
#set($transformer = "com.norconex.importer.handler.transformer.impl")

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>


  <crawlers>
    <crawler id="Sample Crawler">

      <workDir>${workdir}</workDir>

      <startPaths>
        <path>${path}</path>
      </startPaths>
      
      <numThreads>2</numThreads>

      <keepDownloads>false</keepDownloads>

      <importer>
        <postParseHandlers>
          <tagger class="${tagger}.ReplaceTagger">
            <replace fromField="samplefield" regex="true">
              <fromValue>ping</fromValue><toValue>pong</toValue>
            </replace>
            <replace fromField="Subject" regex="true">
				<fromValue>Sample to crawl</fromValue><toValue>Sample crawled</toValue>
			</replace>            
          </tagger>
        </postParseHandlers>
      </importer>
       <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    	<nodes>http://localhost:9200</nodes>
    	<indexName>filetest</indexName>
    	<typeName>filetest1</typeName>
      </committer>
	     <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
      <directory>${workdir}/jsoncrawledFiles</directory>
      <pretty>true</pretty>
      <!-- <docsPerFile>(max number of docs per JSON file)</docsPerFile> -->
      <!-- <compress>[false|true]</compress> -->
      <splitAddDelete>true</splitAddDelete>
      <fileNamePrefix>test</fileNamePrefix>
      <fileNameSuffix>json</fileNameSuffix>
  </committer>
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>${workdir}/crawledFiles</directory>
      </committer>
	
    </crawler>
  </crawlers>

</fscollector>

Feature request - provide refresh and wait for active shards as configurables

This is a simple one to implement, but low priority. Some customers may want to manipulate the refresh and wait_for_active_shards parameters when committing data to ealsticsearch.

That would mean providing something else than Collections.emptyMap() here:

This is related to my elasticsearch commit hang that caused my stop/resume strategy to fail, but isn't the cause of it. Just wanted to write down the need to configure these for some customers.

Deletion not working when a non "document.reference" value is used

hello Pascal,

looks like an issue with deletion: docs don't get deleted from ES when the sourceReferenceField parameter is used:

<sourceReferenceField keep="false">id</sourceReferenceField>

BTW, when "id" is equal to "document.reference", then deletion works fine.

Version:

INFO  [AbstractCollector] Version: Norconex Committer Elasticsearch 4.1.0 (Norconex Inc.)

Thanks!

Narconix Collecttor is not committed to Elastic search.

I have configured the committer no error in console but my content is not committed to elastic .

  <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
       	<indexName>test</indexName>
		<typeName>intranet</typeName>
		<nodes>http://localhost:9200/</nodes>
		<targetContentField>content</targetContentField>
		<maxRetries>1</maxRetries>
		<maxRetryWait>1</maxRetryWait>
		<queueSize>10</queueSize>
		<commitBatchSize>10</commitBatchSize>
		<queueDir>${workdir}/elastic-commiter</queueDir>	
	 </committer>

Had not using the proper tag for committer.

Error when indexing to Elasticsearch through http collector

Command:
collector-http.bat -a start -c examples/kbenp-web-elastic/kbenp-web-elastic-config.xml
kbenp-web-elastic-config.xml.txt

Error:

FATAL [JobSuite] Job suite execution failed: KBenP web elastic
java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
        at org.elasticsearch.threadpool.ThreadPool.<clinit>(ThreadPool.java:190)
        at org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:131)
        at com.norconex.committer.elasticsearch.DefaultClientFactory.buildTransportClient(DefaultClientFactory.java:88)
        at com.norconex.committer.elasticsearch.DefaultClientFactory.createClient(DefaultClientFactory.java:54)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:241)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:255)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:179)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:350)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:300)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:172)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:120)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:80)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Issues after revert to elasticsearch 2.4

Having naively made my crawl work with elasticsearch 5.3.1 and norconex-commiter-elasticsearch 3.0.0, I am told that our server is running elasticsearch 2.4.x

I have attempted to back-down to norconex-commiter-elasticsearch 2.1.0, and to override the elasticsearch version to 2.4.5 to get the latest in that vein. However, I'm still using norconex-collector-http version 2.7.0

At the end of the artificially short crawl, I see the exception below.

[2]: index [medlineplus], type [page], id [6c56fd8b-872a-43c7-af54-c60c90f184fd], message 
[RemoteTransportException[[Rebel][127.0.0.1:9300][indices:data/write/bulk[s]]]; nested: 
RemoteTransportException[[Rebel][127.0.0.1:9300][indices:data/write/bulk[s][p]]]; nested: 
UnavailableShardsException[[medlineplus][1] Not enough active copies to meet write consistency of 
[QUORUM] (have 1, needed 2). Timeout: [1m], request: [BulkShardRequest to [medlineplus] containing [3] requests]];]
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.sendBulkToES(ElasticsearchCommitter.java:335)
...

I appreciate any help you can provide in addressing this issue.

Facebook crawling with elastic

I think the ElasticsearchCommitter fits me well, but I got an error after the setup, and running.

INFO  [AbstractFileQueueCommitter] Committing 55 files
INFO  [ElasticsearchCommitter] Sending 55 operations to Elasticsearch.
INFO  [Node] [Wyatt Wingfoot] version[2.3.5], pid[49444], build[90f439f/2016-07-27T10:36:52Z]
INFO  [Node] [Wyatt Wingfoot] initializing ...
INFO  [PluginsService] [Wyatt Wingfoot] modules [], plugins [], sites []
INFO  [AbstractCrawler] Facebook Posts: Crawler executed in 1 second.
FATAL [JobSuite] Fatal error occured in job: Facebook Posts
INFO  [JobSuite] Running Facebook Posts: END (Thu Apr 06 15:14:07 CEST 2017)
FATAL [JobSuite] Job suite execution failed: Facebook Posts
java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
	at org.elasticsearch.threadpool.ThreadPool.<clinit>(ThreadPool.java:190)
	at org.elasticsearch.node.Node.<init>(Node.java:170)
	at org.elasticsearch.node.Node.<init>(Node.java:140)
	at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
	at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:150)

I arleady istalled elasticsearch and I'm able to put data in it so it works, I installed the committer 3.0 bc I found ur comment somewhere about that is the right version for elastic 5.x.
I put the jars to the lib folder as it is explaining. and set up the xml in this way:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
  <indexname>events</indexname>
  <typename>face-crawled</typename>
  <clustername>data-mining</clustername>
  <clusterhosts>localhost</clusterhosts>
</committer>

which is matching the settings of my elasticsearch cluster, node and index.

I think somehow it doesn't find the right function.
can you help me out?
Thanks

Committer for AWS Elastic Search

Hi,

I am using the committer for my local elasticsearch instance and is working perfectly fine.
I am trying to commit to AWS ElasticSearch so now where should i give the AWS Key and Password to connect to AWS.

Is there any example or any documentation to use AWS ElasticSearch instead of a local instance.

Thanks

utf-8 unicode

Hi everybody. I am new to Norconex. I am trying to write a crawler using this config. but i got unicode characters instead of utf-8 (Persian) in the results.

fields into Elasticsearch

Testing your products, so far crawler works and was much easier to follow than Nutch or Stormcrawler.

I am sending to elastic 5.6 and the data is there. However, I am a bit confused on the fields as I am not seeing what I had expected. I would like to see ALL available data fields sent to ES, and then I can pair back as needed.

In your example:

    <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
           <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer>

If I don't set it provides even less data in ES. How can I send all data fields the crawler finds? (or... how do I know what data fields are available to set this value?)

Also confused on sourceContentField & targetContentField. ie. what are the sourceContent fields to choose from? I really want to end up with a field in ES that has all the text from the page or document.

Thanks!

Committer closed without sending any documents

I have configured an Elasticsearch domain in AWS and verified it works by PUTting a document into it using curl.

However, when running the http-collector configured with the elasticsearch-committer the committer just closes without sending any documents or reporting any errors:

INFO  [AbstractCrawler] MyWebsite: Crawler finishing: committing documents.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] MyWebsite: 4 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] MyWebsite: Crawler completed.
INFO  [AbstractCrawler] MyWebsite: Crawler executed in 12 seconds.
INFO  [SitemapStore] MyWebsite: Closing sitemap store...
INFO  [JobSuite] Running MyWebsite: END (Mon Feb 10 08:58:07 UTC 2020)

This line (INFO [ElasticsearchCommitter] Elasticsearch RestClient closed.) is the only output I get from the committer which is configured as follows:

        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <nodes>https://hostname-in-aws</nodes>
            <indexName>mywebsite</indexName>
            <queueSize>1</queueSize>
            <commitBatchSize>1</commitBatchSize>
            <ignoreResponseErrors>false</ignoreResponseErrors>
        </committer>

How can I increase the log level? Or - what could be the problems here?

Feature Request: Nested Fields

Hello,

I would like to be able to create a nested field in my elastic search ingested documents. (Reference)

Rather than doing this programaticaly, One potential idea would be to allow the user to modify the default mapping using a configuration json file in a text editor. (this is the way it was done in another crawler). Essentially, the default mapping was provided in a json file, then I was able to edit it to allow for the field to be included when creating the index. The committer would just need to validate the integrity or just send as is to ES.

Thank you for the hard work on this project.
John

Index name determined by parameters

I've just returned from elasticsearch Core Developers training, and I've learned that the most common idiom is to control your indices settings and mappings using "templates", and control what you are searching using "aliases". Tools like logstash support an index name that depends on parameters, e.g. the index name actually depends on properties of the data to be indexed.

Issue Norconex/crawlers#359 would address this, but a more thorough solution would be to allow date/timestamp substitution in the indexName, or to have a 1-up counter used on some periodic time basis.

These are ideas - both this issue and the former issue are syntactic sugar, as someone can always use Elasticsearch "alias" feature for indexing as well as search. That is, it is possible to use aliases for indexing as well, so that an alias such as "allcrawls" is used for searching, and an alias such as "current-crawl" is used for indexing.

WARN - unknown role

there are may warnings at the end of every crawl, e.g.

2021-04-08 12:19:03 INFO - Done sending commit operations to Elasticsearch.
2021-04-08 12:19:03 WARN - unknown role [data_cold] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [data_content] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [data_hot] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [data_warm] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [ml] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [remote_cluster_client] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [transform] on node [3kCOVz6aSLCMVNM6Vbfl3A]
2021-04-08 12:19:03 WARN - unknown role [data_cold] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [data_content] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [data_hot] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [data_warm] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [ml] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [remote_cluster_client] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [transform] on node [xppp3e6FQ-61H1dgWu4gdA]
2021-04-08 12:19:03 WARN - unknown role [data_cold] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 WARN - unknown role [data_content] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 WARN - unknown role [data_hot] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 WARN - unknown role [data_warm] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 WARN - unknown role [ml] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 WARN - unknown role [remote_cluster_client] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 WARN - unknown role [transform] on node [fHledkKjQu2x-wuqfNNykg]
2021-04-08 12:19:03 INFO - Elasticsearch RestClient closed.

I think, it's a known issue: elastic/elasticsearch#52864

Could you please update the es client any chance?
Thank you!

Caused by: java.lang.IllegalArgumentException: Negative buffer size

Hi,

I am getting these new error messages when committing to Elasticsearch. I am running a test on a small directory of ~200 files and none are being committed to ES. This is the error message from the log file:

com.norconex.committer.core.CommitterException: Could not commit JSON batch to Elasticsearch.
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:599)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:537)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)
Caused by: java.lang.IllegalArgumentException: Negative buffer size
        at java.io.StringWriter.<init>(StringWriter.java:67)
        at org.apache.commons.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:67)
        at org.apache.commons.text.StringEscapeUtils.escapeJson(StringEscapeUtils.java:585)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.appendValue(ElasticsearchCommitter.java:727)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.append(ElasticsearchCommitter.java:718)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.append(ElasticsearchCommitter.java:697)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.appendAddOperation(ElasticsearchCommitter.java:680)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:577)
        ... 15 more

Thank you for your help,
John

add analyzer to specific field

Using version 3.0.0-SNAPSHOT
When executing command like this: GET /index/type/_mapping/field/content see this:

{
  "index": {
    "mappings": {
      "type": {
        "content": {
          "full_name": "content",
          "mapping": {
            "content": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      }
    }
  }
}

is it possible to add specific analyzer for specific fields?
Like described here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html
Most of my content is in russian language and i want perform seraching by content field using russian morfology and stop words.

HTTP collector never exits when committing to elasticsearch

When committing to elasticsearch (see the below config), the collector-http.sh script never terminates even though the crawler run has already ended. I have to manually kill the process using CTRL+C or kill.

This is using the norconex-collector-http-2.4.0-20151209.033143-7 snapshot and norconex-committer-elasticsearch-2.0.1 against elasticsearch 1.7.3. I understand that there might be an incompatibility between the committer and ES 1.7 but the documents are committed to ES just fine. Please close this issue if relates to #2 after all.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <maxDocuments>1</maxDocuments>

      <committer
        class="com.norconex.committer.elasticsearch.ElasticsearchCommitter"
      >
        <indexName>crawler_test</indexName>
        <typeName>test_doc</typeName>
        <queueSize>1</queueSize>
        <commitBatchSize>1</commitBatchSize>
      </committer>

      <startURLs stayOnDomain="true">
        <url>https://biertastisch.de</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Normal output:

$ ./collector-http.sh -a start -c test.xml
INFO  [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.3.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.4.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [AbstractCollector] Version: "ElasticsearchCommitter" version is undefined.
INFO  [JobSuite] Running test-crawler: BEGIN (Wed Dec 09 18:11:13 CET 2015)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO  [HttpCrawler] test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] test-crawler: Sitemap support: true
INFO  [HttpCrawler] test-crawler: Canonical links support: true
INFO  [HttpCrawler] test-crawler: User-Agent: <None specified>
INFO  [SitemapStore] test-crawler: Initializing sitemap store...
INFO  [SitemapStore] test-crawler: Done initializing sitemap store.
ERROR [StandardSitemapResolver] Could not obtain sitemap: https://biertastisch.de/sitemap.xml.  Expected status code 200, but got 301
ERROR [StandardSitemapResolver] Could not obtain sitemap: https://biertastisch.de/sitemap_index.xml.  Expected status code 200, but got 301
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] test-crawler: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://biertastisch.de
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://biertastisch.de
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/info/datenschutz
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/info/datenschutz
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/info/copyright
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/info/copyright
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/warenkorb
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/warenkorb
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/info/agb
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/info/agb
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://biertastisch.de
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://biertastisch.de
INFO  [AbstractCommitter] Max queue size reached (1). Committing
INFO  [AbstractFileQueueCommitter] Committing 1 files
INFO  [ElasticsearchCommitter] Sending 1 operations to Elasticsearch.
INFO  [InternalNode] [Scarecrow] version[1.5.0], pid[1496], build[5448160/2015-03-23T14:30:58Z]
INFO  [InternalNode] [Scarecrow] initializing ...
INFO  [PluginsService] [Scarecrow] loaded [], sites []
INFO  [InternalNode] [Scarecrow] initialized
INFO  [InternalNode] [Scarecrow] starting ...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [TransportService] [Scarecrow] bound_address {inet[/0:0:0:0:0:0:0:0:9303]}, publish_address {inet[/192.168.178.27:9303]}
INFO  [DiscoveryService] [Scarecrow] elasticsearch/cpjCKhykSGmGpUAkH-aw2g
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [AbstractCommitter] Max queue size reached (1). Committing
INFO  [AbstractFileQueueCommitter] Committing 1 files
INFO  [ElasticsearchCommitter] Sending 1 operations to Elasticsearch.
INFO  [InternalNode] [Artie] version[1.5.0], pid[1496], build[5448160/2015-03-23T14:30:58Z]
INFO  [InternalNode] [Artie] initializing ...
INFO  [PluginsService] [Artie] loaded [], sites []
INFO  [InternalNode] [Artie] initialized
INFO  [InternalNode] [Artie] starting ...
INFO  [TransportService] [Artie] bound_address {inet[/0:0:0:0:0:0:0:0:9304]}, publish_address {inet[/192.168.178.27:9304]}
INFO  [DiscoveryService] [Artie] elasticsearch/EmR8maD2R3e0jyHtjq2RCw
INFO  [InternalClusterService$UpdateTask] [Scarecrow] detected_master [Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]], added {[Belathauzer][HGPuMZyJTK2oLzLgIXnAKw][ordielite][inet[/192.168.178.27:9302]]{client=true, data=false},[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]],[Sweetface][X14YYANjSf25y7d5W8wAlg][ordielite][inet[/192.168.178.27:9301]]{client=true, data=false},}, reason: zen-disco-receive(from master [[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]]])
INFO  [HttpServer] [Scarecrow] bound_address {inet[/0:0:0:0:0:0:0:0:9203]}, publish_address {inet[/192.168.178.27:9203]}
INFO  [InternalNode] [Scarecrow] started
INFO  [InternalClusterService$UpdateTask] [Artie] detected_master [Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]], added {[Belathauzer][HGPuMZyJTK2oLzLgIXnAKw][ordielite][inet[/192.168.178.27:9302]]{client=true, data=false},[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]],[Sweetface][X14YYANjSf25y7d5W8wAlg][ordielite][inet[/192.168.178.27:9301]]{client=true, data=false},[Scarecrow][cpjCKhykSGmGpUAkH-aw2g][ordielite][inet[/192.168.178.27:9303]]{client=true, data=false},}, reason: zen-disco-receive(from master [[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]]])
INFO  [InternalClusterService$UpdateTask] [Scarecrow] added {[Artie][EmR8maD2R3e0jyHtjq2RCw][ordielite][inet[/192.168.178.27:9304]]{client=true, data=false},}, reason: zen-disco-receive(from master [[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]]])
INFO  [HttpServer] [Artie] bound_address {inet[/0:0:0:0:0:0:0:0:9204]}, publish_address {inet[/192.168.178.27:9204]}
INFO  [InternalNode] [Artie] started
INFO  [ElasticsearchCommitter] Done sending operations to Elasticsearch.
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [ElasticsearchCommitter] Done sending operations to Elasticsearch.
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://biertastisch.de
INFO  [AbstractCrawler] test-crawler: 3% completed (2 processed/60 total)
INFO  [AbstractCrawler] test-crawler: Maximum documents reached: 1
INFO  [AbstractCrawler] test-crawler: Maximum documents reached: 1
INFO  [AbstractCrawler] test-crawler: Max documents reached. Not reprocessing orphans (if any).
INFO  [AbstractCrawler] test-crawler: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 1 files
INFO  [ElasticsearchCommitter] Sending 1 operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending operations to Elasticsearch.
INFO  [AbstractCrawler] test-crawler: 2 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] test-crawler: Crawler completed.
INFO  [AbstractCrawler] test-crawler: Crawler executed in 10 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
INFO  [JobSuite] Running test-crawler: END (Wed Dec 09 18:11:13 CET 2015)

The much longer debug output can be found here. Note that everything following INFO [JobSuite] Running test-crawler: END (Wed Dec 09 18:24:12 CET 2015) is only printed to the console but not appended to the log file.

Feature request: Specify ES settings and mappings

I have 4 requirements to configure specific aspects of ES:

  1. Set field limit to 2000
  2. Create custom analyzers and tokenizers
  3. Create nested fields
  4. Set specific field properties (document.reference to indexed with offsets and custom analyzer)

I currently do all of the above by sending the following settings/mappings to ES prior to starting Norconex jobs:

PUT wmsearch
{
  "settings": {
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "custom": {
          "type": "custom",
          "tokenizer": "custom_token",
          "filter": [
            "lowercase"
          ]
        },
        "custom2": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "custom_token": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 30
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "document": {
          "properties": {
            "reference": {
              "type": "text",
              "index_options": "offsets",
              "analyzer": "custom"
            }
          }
        },
        "scope" : {
          "type" : "nested",
          "properties" : {
            "level" : { 
              "type" : "integer"
            },
            "ancestors" : { 
              "type" : "keyword",
              "index" : "true"
            },
            "value" : { 
              "type" : "keyword",
              "index" : "true"
            },
            "order" : {
              "type" : "integer"
            }    
          }
        }
      }
    }
  }
}

This does the job. However, it would be nice to specify these settings in the xml file. Perhaps a committer tag with the above pasted?

Thanks for your consideration.

type name in ElasticSearch sometimes is not corresponding as specified in a crawler

Hi, I encountered a problem as following:
I had 2 crawlers:

  1. vnexpress crawler will commit data into vnexpress type in ElasticSearch.
  2. dantri crawler will commit data into dantri type in ElasticSearch.

But, sometimes a document from vnexpress crawler is committed to dantri type and vice versa.

I used:

  • Norconex collector http 2.6.0
  • Norconex-committer-elasticsearch-2.1.0

This is my config file:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Crawler">
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>
    <crawlerDefaults>
        <maxDepth>5</maxDepth>
        <sitemapResolverFactory ignore="true" />
        <delay default="1000" />
        <importer>
            <preParseHandlers>
                <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                    onMatch="include" field="document.contentType">
                (text/html|text/htm)
                </filter>
            </preParseHandlers>
            <postParseHandlers>
              <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                <fields>title,document.reference</fields>
              </tagger>
            </postParseHandlers>
      </importer> 
    </crawlerDefaults>  
  <crawlers>    
    <crawler id="vnexpress">
        <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
            <url>http://vnexpress.net/</url>
        </startURLs>        
        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>news</indexName>
            <typeName>vnexpress</typeName>  
            <queueSize>1</queueSize>
        </committer>
    </crawler>
    <crawler id="dantri">
        <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
            <url>http://dantri.com.vn/</url>
        </startURLs>        
        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>news</indexName>
            <typeName>dantri</typeName> 
            <queueSize>1</queueSize>
        </committer>
    </crawler>
  </crawlers>

</httpcollector>

This is data in ElasticSearch
![norconex](https://cloud.githubusercontent.com/assets/1876051/18226348/fc89a9e4-7231-11e6-81f3-26643a04eb56.PNG)

Nodes configuration seems not working: Invalid HTTP host at org.apache.http.HttpHost.create(HttpHost.java:122)

This is my config.xml crawler section:

<crawler id="testsite.com">
	<startURLs>
		<url>http://www.testsite.com</url>
	</startURLs>
	<committer class="${committer}.elasticsearch.ElasticsearchCommitter">
		<nodes>http://elasticsearch:9200/</nodes>
	</committer>
</crawler>

But running the crawler it crash at:
java.lang.IllegalArgumentException: Invalid HTTP host: elasticsearch:9200/

Here the full log:

norconex         | ERROR [AbstractBatchCommitter] Could not commit batched operations.
norconex         | java.lang.IllegalArgumentException: Invalid HTTP host: elasticsearch:9200/
norconex         | 	at org.apache.http.HttpHost.create(HttpHost.java:122)
norconex         | 	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.createRestClient(ElasticsearchCommitter.java:594)
norconex         | 	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.nullSafeRestClient(ElasticsearchCommitter.java:582)
norconex         | 	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:418)
norconex         | 	at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
norconex         | 	at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
norconex         | 	at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
norconex         | 	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
norconex         | 	at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:270)
norconex         | 	at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:226)
norconex         | 	at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
norconex         | 	at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
norconex         | 	at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
norconex         | 	at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
norconex         | 	at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
norconex         | 	at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
norconex         | 	at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
norconex         | 	at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

ElasticSearch Committer Error

Running the lastest 3.0.0 M1 with elasticsearch 5.0.0 m1

Per the doc, it seems like typename should be there: https://opensource.norconex.com/committers/elasticsearch/v4/configuration

But it may have changed with 5?

      <committers>
        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
          <nodes>https://searcdfhdhdhdhdhdhdhdhdhhdhdh1.es.amazonaws.com:443</nodes>
          <indexName>docs</indexName>
          <typeName>docs</typeName>
          <targetContentField>fs_content</targetContentField>
          <fixBadIds>true</fixBadIds>
        </committer>
      </committers>

With this config,

./collector-http.sh start -c forescout/docs/docs-config/docs-config.xml 

1 XML configuration errors detected:

[XML] StartCommand: cvc-complex-type.2.4.a: Invalid content was found starting with element 'typeName'. One of '{restrictTo, fieldMappings, queue, ignoreResponseErrors, discoverNodes, dotReplacement, credentials, jsonFieldsPattern, connectionTimeout, socketTimeout, fixBadIds, sourceIdField, targetContentField}' is expected.

If I remove "typeName", it errors when trying to commit to ES:

Caused by: org.elasticsearch.client.ResponseException: method [POST], host [https://search-seservices-qyt22kq34vaaadaz465jecxama.us-east-1.es.amazonaws.com:443], URI [/_bulk], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: type is missing;2: type is missing;3: type is missing;4: type is missing;5: type is missing;6: type is missing;7: type is missing;8: type is missing;9: type is missing;10: type is missing;11: type is missing;12: type is missing;13: type is missing;14: type is missing;15: type is missing;16: type is missing;17: type is missing;18: type is missing;19: type is missing;20: type is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: type is missing;2: type is missing;3: type is missing;4: type is missing;5: type is missing;6: type is missing;7: type is missing;8: type is missing;9: type is missing;10: type is missing;11: type is missing;12: type is missing;13: type is missing;14: type is missing;15: type is missing;16: type is missing;17: type is missing;18: type is missing;19: type is missing;20: type is missing;"},"status":400}
	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283) ~[elasticsearch-rest-client-7.8.1.jar:7.8.1]
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261) ~[elasticsearch-rest-client-7.8.1.jar:7.8.1]
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-7.8.1.jar:7.8.1]
	at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:506) ~[norconex-committer-elasticsearch-5.0.0-M1.jar:5.0.0-M1]
	... 23 more

Using Elasticsearch REST API

From @niels, in Norconex/crawlers#200 (comment):

A couple of weeks ago, I also modified the elasticsearch committer to use ES's REST interface โ€“ so I am anyway trying to get more involved in development of the ecosystem. I developed this as a separate committer so that users can choose for themselves whether to use the Node client (of the existing committer) or the REST client (of my new committer).

As I haven't yet gotten around to writing a test suite, I was thus far too hesitant to "officially submit" the new committer to you. But seeing how I am critiquing your "secret" commits, it is only fair to expose my own.. So here it is, if you are interested in taking a peek: https://github.com/herimedia/norconex-committer-elasticsearch-rest.

Caused by: java.io.IOException: listener timeout after waiting for [30000] ms

Hi,

I ran into an issue with what I believe are large files that cause listener timeout errors when committing to ES. I believe the ES java client defaults to 30 seconds before giving up which is what I believe is happening. Can you provide a way to increase the timeout as I am unable to commit the remaining large files to ES? I think this may be the way to increase the timeouts. Here is the error I am receiving:

INFO  [ElasticsearchCommitter] Sending 50 commit operations to Elasticsearch.
ERROR [ElasticsearchCommitter$1] Failure occured on node: "http://localhost:9200". Check node logs.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] WM Search Elastic: Crawler executed in 1 minute 30 seconds.
ERROR [JobSuite] Execution failed for job: WM Search Elastic
com.norconex.committer.core.CommitterException: Could not commit JSON batch to Elasticsearch.
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:489)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:427)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)
Caused by: java.io.IOException: listener timeout after waiting for [30000] ms
        at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:660)
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:219)
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:191)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:480)
        ... 15 more
INFO  [JobSuite] Running WM Search Elastic: END (Sun Oct 29 10:52:20 EDT 2017)

On a side note, I also was receiving these errors that I was able to circumvent by decreasing the number of commits to 50 instead of 100:

WM Search: 2017-10-26 18:57:28 FATAL - WM Search: An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues...
com.norconex.committer.core.CommitterException: Could not commit JSON batch to Elasticsearch.
  at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:489)
  at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
  at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
  at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
  at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
  at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:427)
  at com.norconex.committer.core.AbstractCommitter.commitIfReady(AbstractCommitter.java:146)
  at com.norconex.committer.core.AbstractCommitter.add(AbstractCommitter.java:97)
  at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:34)
  at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:27)
  at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
  at com.norconex.collector.fs.crawler.FilesystemCrawler.executeCommitterPipeline(FilesystemCrawler.java:243)
  at com.norconex.collector.core.crawler.AbstractCrawler.processImportResponse(AbstractCrawler.java:595)
  at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:541)
  at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
  at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.ConnectionClosedException: Connection closed unexpectedly
  at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.closed(HttpAsyncRequestExecutor.java:140)
  at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:71)
  at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:39)
  at org.apache.http.impl.nio.reactor.AbstractIODispatch.disconnected(AbstractIODispatch.java:100)
  at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionClosed(BaseIOReactor.java:279)
  at org.apache.http.impl.nio.reactor.AbstractIOReactor.processClosedSessions(AbstractIOReactor.java:440)
  at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:283)
  at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
  at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
  ... 1 more
WM Search: 2017-10-26 18:57:28 INFO -          CRAWLER_STOPPING

I am wondering if there is a bulk setting parameter that you can also tweak to enable large transfers. Referencing someone's resolution after googling:

I faced the same issue and finally the issue got resolved by the use of request_timeout parameter instead of timeout. 

So the call must be like this helpers.bulk(es,actions,chunk_size=some_value,request_timeout=some_value)

Thanks for your help again :)

Facebook crawler using on events, elastic committer commits the data in an array or block

I'm using Norconex crawler on facebook Graph API /events/ and it is crawling down the data, but when it commits it to the elastic kibana sees the data in one block, so it cannot "index" it.

As I know it should put the each element one by one, but rather it put's as big arrays of elements and kibana cannot identify the fields

I attach an image to show it.
screen shot 2017-04-24 at 12 15 05

Message: listener timeout after waiting for [30000] ms

I am trying to implement simple CRUD operation on Elasticsearch using groovy and grails
some time i am able to create an index and some time i am getting below mentioned exception that is time out, i have trying some many way non of them are working fine. i stuck here can some help me to get out from this.
below the exception i have attached the code which i am using here, please go through it and check whether it is correct or not
thank you for your help in advance

Error |
2018-05-29 23:13:18,320 [http-bio-8080-exec-10] ERROR errors.GrailsExceptionResolver - IOException occurred when processing request: [GET] /Sharama1/person/addPerson
listener timeout after waiting for [30000] ms. Stacktrace follows:
Message: listener timeout after waiting for [30000] ms
Line | Method
->> 661 | get in org.elasticsearch.client.RestClient$SyncResponseListener


| 220 | performRequest in org.elasticsearch.client.RestClient
| 192 | performRequest . . . . . . . . . . . in ''
| 428 | performRequest in org.elasticsearch.client.RestHighLevelClient
| 414 | performRequestAndParseEntity . . . . in ''
| 299 | index in ''
| -2 | invoke0 . . . . . . . . . . . . . . . in sun.reflect.NativeMethodAccessorImpl
| 62 | invoke in ''
| 43 | invoke . . . . . . . . . . . . . . . in sun.reflect.DelegatingMethodAccessorImpl
| 497 | invoke in java.lang.reflect.Method
| 1426 | jlrMethodInvoke . . . . . . . . . . . in org.springsource.loaded.ri.ReflectiveInterceptor
| 189 | invoke in org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSite
| 53 | call . . . . . . . . . . . . . . . . in org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite
| 45 | defaultCall in org.codehaus.groovy.runtime.callsite.CallSiteArray
| 108 | call . . . . . . . . . . . . . . . . in org.codehaus.groovy.runtime.callsite.AbstractCallSite
| 116 | call in ''
| 70 | doCall . . . . . . . . . . . . . . . in sharama1.PersonController$_closure3
| -1 | doCall in ''
| -2 | invoke0 . . . . . . . . . . . . . . . in sun.reflect.NativeMethodAccessorImpl
| 62 | invoke in ''
| 43 | invoke . . . . . . . . . . . . . . . in sun.reflect.DelegatingMethodAccessorImpl
| 497 | invoke in java.lang.reflect.Method
| 1426 | jlrMethodInvoke . . . . . . . . . . . in org.springsource.loaded.ri.ReflectiveInterceptor
| 90 | invoke in org.codehaus.groovy.reflection.CachedMethod
| 233 | doMethodInvoke . . . . . . . . . . . in groovy.lang.MetaMethod
| 1086 | invokeMethod in groovy.lang.MetaClassImpl
| 1110 | invokeMethod . . . . . . . . . . . . in groovy.lang.ExpandoMetaClass
| 910 | invokeMethod in groovy.lang.MetaClassImpl
| 411 | call . . . . . . . . . . . . . . . . in groovy.lang.Closure
| -1 | call in sharama1.PersonController$_closure3

=====================
Code for Creating an index

def addPerson={
RestHighLevelClient client=ESService.getClient()
Map<String,Object> jsonMap = new HashMap<>();
jsonMap.put("firstName","abcd");
jsonMap.put("lastName","xyz");
jsonMap.put("date",new Date());
jsonMap.put("message","Hugh data Index mapping");
IndexRequest indexRequest = new IndexRequest("person1","hughdata","4").source(jsonMap);
IndexResponse res = client.index(indexRequest);
String index = res.getIndex()
String type = res.getType()
String id = res.id
long version = res.getVersion()
DocWriteResponse.Result result = res.getResult();

    if (result == DocWriteResponse.Result.CREATED){
        println("index created = "+result)
    }
    else if (result == DocWriteResponse.Result.UPDATED){
        println("index Updated = "+result)
    }
    ["index":index,"type": type,"id":id,"version":version]
}

=======================
code for get the client

class ESService {
RestHighLevelClient client=null
//TransportClient client=null
def RestHighLevelClient getClient(){
try {
String hostname = "localhost"
int port = 9200
String scheme = "http"

            client = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost",9200,"http")))

            boolean pingResponse = client.ping()

            if (pingResponse == true) {
                print("connection established..." + pingResponse);
            } else {
                print("connection not established. Try again : " + pingResponse)
            }
            /*return client*/
        }
        catch (ElasticsearchException e){
            e.printStackTrace()
        }
        return client
    }

}

commiter-elasticsearch compatibility with elasticsearch v5.x

Have been testing out the http-connector using the commiter-elasticsearch but getting compatibility issues with v5.x of elasticsearch using either the node or transport client options.

from elasticsearch.log
[2016-12-22T10:05:35,975][WARN ][o.e.t.n.Netty4Transport ] [wJOGjRs] exception caught on transport layer [[id: 0x8b922be6, L:/0:0:0:0:0:0:0:1:9300 - R:/0:0:0:0:0:0:0:1:53549]], closing connection
java.lang.IllegalStateException: Received message from unsupported version: [2.0.0] minimal compatible version is: [5.0.0]

Are there any planned updates to commiter-elasticsearch for v5.x compatibility or any known workarounds to get around this issue?

Committer cannot connect to Elasticsearch

I'm trying to use the committer in conjunction with the AWS Elasticsearch Service. I've configured the AWS instance to grant full access to the IP address being used by the machine running the committer but when the software gets to the point where it is trying to commit documents, I get this error:

ERROR [AbstractCrawler] Wiki Crawler: Could not process document: https://wiki.linaro.org/FrontPage (None of the configured nodes are available: [{#transport#-1}{52.55.65.171}{search-websites-uzjmeau3ffjrauoeew5ow3lxkq.us-east-1.es.amazonaws.com/52.55.65.171:9300}])
NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{52.55.65.171}{search-websites-uzjmeau3ffjrauoeew5ow3lxkq.us-east-1.es.amazonaws.com/52.55.65.171:9300}]]
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:290)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:207)
at org.elasticsearch.client.transport.support.TransportProxyClient.execute(TransportProxyClient.java:55)
at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:288)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:86)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:56)
at com.norconex.committer.elasticsearch.ElasticsearchCommitter.sendBulkToES(ElasticsearchCommitter.java:329)
at com.norconex.committer.elasticsearch.ElasticsearchCommitter.bulkAddedDocuments(ElasticsearchCommitter.java:288)
at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:257)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
at com.norconex.committer.core.AbstractCommitter.commitIfReady(AbstractCommitter.java:146)
at com.norconex.committer.core.AbstractCommitter.add(AbstractCommitter.java:97)
at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:34)
at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:27)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeCommitterPipeline(HttpCrawler.java:354)
at com.norconex.collector.core.crawler.AbstractCrawler.processImportResponse(AbstractCrawler.java:549)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:506)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:390)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:771)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I've got the committer configured to use the transport client (and temporarily configured to commit frequently):

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <indexName>webpages</indexName>
    <typeName>webpage</typeName>
    <clusterHosts nodeClient="false">aws-fqdn</clusterHosts>
    <queueSize>1</queueSize>
    <commitBatchSize>1</commitBatchSize>
</committer>

I've tried switching to the node client but I then get an error about not being able to load mustache. I'm also uncertain about what I have to put into bindIp. If it is the IP address of the Elasticsearch server, AWS actually provides two IP addresses (presumably for load balancing) so I'm not sure how that is supposed to work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.