Giter Site home page Giter Site logo

elasticsearch-analysis-url's People

Contributors

chirag-velotio avatar jalaziz avatar jlinn avatar mathewmeconry avatar suzuken avatar zakird avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-url's Issues

Which version for ES 5.5.2

Hi which release is for ES version 5.5.2 as i tried to install 5.5.1.0 of this plugin, I got an error

Exception in thread "main" java.lang.IllegalArgumentException: plugin [analysis-url] is incompatible with version [5.5.2]; was designed for version [5.5.1]
at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:146)
at org.elasticsearch.plugins.InstallPluginCommand.verify(InstallPluginCommand.java:474)
at org.elasticsearch.plugins.InstallPluginCommand.install(InstallPluginCommand.java:543)
at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:217)
at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:201)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122)
at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:69)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122)
at org.elasticsearch.cli.Command.main(Command.java:88)
at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)

ES 5.2.x Bulk Indexing Errors

We're running into an issue where we can't seem to bulk index records that have fields that are analyzed using the tokenizer. The errors seem to always be of the following form:

[2017-02-21T18:41:16,181][DEBUG][o.e.a.s.TransportSearchAction] [es-frontend-1] [163] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [es-data-9][10.0.118.139:9300][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.search.fetch.FetchPhaseExecutionException: Fetch Failed [Failed to highlight field [parsed.extensions.authority_info_access.ocsp_urls]]
	at org.elasticsearch.search.fetch.subphase.highlight.PlainHighlighter.highlight(PlainHighlighter.java:140) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightPhase.hitExecute(HighlightPhase.java:124) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:163) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:489) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:354) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:351) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.1.jar:5.2.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: concurrent_modification_exception: null
	at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) ~[?:1.8.0_121]
	at java.util.ArrayList$Itr.next(ArrayList.java:851) ~[?:1.8.0_121]
	at org.elasticsearch.index.analysis.url.URLTokenizer.tokenize(URLTokenizer.java:176) ~[?:?]
	at org.elasticsearch.index.analysis.url.URLTokenizer.incrementToken(URLTokenizer.java:124) ~[?:?]
	at org.elasticsearch.index.analysis.url.URLTokenFilter.tokenize(URLTokenFilter.java:192) ~[?:?]
	at org.elasticsearch.index.analysis.url.URLTokenFilter.advance(URLTokenFilter.java:147) ~[?:?]
	at org.elasticsearch.index.analysis.url.URLTokenFilter.incrementToken(URLTokenFilter.java:122) ~[?:?]
	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:216) ~[lucene-highlighter-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:44:23]
	at org.elasticsearch.search.fetch.subphase.highlight.PlainHighlighter.highlight(PlainHighlighter.java:125) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightPhase.hitExecute(HighlightPhase.java:124) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:163) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:489) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:354) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.action.search.SearchTransportService$13.messageReceived(SearchTransportService.java:351) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.1.jar:5.2.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.1.jar:5.2.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]

And this will occur for random fields that use this analyzer. It never looks to occur for fields that use other tokenizers. These fields do not have any special high-lighting settings.

This is the configuration of the filter and analyzer:

curl -XPUT 'localhost:9200/certificates/_settings' -d '{
  "analysis" : {
    "filter":{
      "URL":{
        "type":"url",
        "part":["protocol", "host", "port", "path", "query", "ref"],
        "url_decode":true,
        "allow_malformed":true,
        "tokenize_malformed":true
      }
    }
  }
}'
curl -XPUT 'localhost:9200/certificates/_settings' -d '{
  "analysis" : {
    "analyzer":{
      "URL":{
        "type":"custom",
        "tokenizer":"whitespace",
        "filter":["URL"]
      }
    }
  }
}'

I'm happy to provide any additional information that's helpful, or to help troubleshoot. Any ideas?

failed to create index - java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.url.UrlTokenizerFactory

I'm trying out version 1.0.0 of this plugin in my test environment (ES 1.4.4), so this may be something that is already fixed in newer versions(I'm getting same issue with 2.4.x version of plugin). It seems that the plugin is loaded but the UrlTokenizerFactory is not found. Any idea of what might be going on?

Caused by: org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class setting [type] with value [url]
	at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:471)
	at org.elasticsearch.common.settings.ImmutableSettings.getAsClass(ImmutableSettings.java:459)
	at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:298)
	... 16 more
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.url.UrlTokenizerFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:469)

Also, unrelated, can I use this plugin to find all documents with a specific url field that has the same host and path but different query parameters?

Sorry if this is not the right place to ask these questions. If so, where should I post them?

Cheers,
Carl

failed to create index - java.lang.NoSuchMethodError: 'void org.elasticsearch.index.analysis.AbstractTokenizerFactory

I am trying to migrate my cluster from Elasticsearch 6.8 to Elasticsearch 7.5.1,I am getting below error message -
java.lang.NoSuchMethodError: 'void org.elasticsearch.index.analysis.AbstractTokenizerFactory.<init>(org.elasticsearch.index.IndexSettings, java.lang.String, org.elasticsearch.common.settings.Settings)' at org.elasticsearch.index.analysis.URLTokenizerFactory.<init>(URLTokenizerFactory.java:28) ~[?:?] at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:444) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:285) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:213) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:419) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:553) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:502) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:165) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:502) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:264) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.5.1.jar:7.5.1] at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?] at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.5.1.jar:7.5.1] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.5.1.jar:7.5.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:830) [?:?]

Any Idea what might cause this issue.

Cannot install on ES2.0 (or 2.1)

On a fresh install of ES2.0, the command

bin/plugin install analysis-url --url https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.1.0/elasticsearch-analysis-url-2.1.0.zip

does not work anymore. I changed it for

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.1.0/elasticsearch-analysis-url-2.1.0.zip

which is the correct syntax for ES2.0. But now, ES complains about a missing "plugin-descriptor.properties" file in the plugin (this file is mandatory on ES2.0).

Am I doing something wrong here?

Btw, awesome plugin! I had been scratching my head for days about this issue and almost implemented it source-side (doing the "right" tokenization before putting the document in ES), which isn't nearly as elegant as your plugin.

ES 2.4.3?

Could you please build a package for es 2.4.3?

Unexpected number of tokens from host part

I have follow the readme examples but, when testing locally the last one is not returning the same results as in the readme:

curl -XPUT 'http://localhost:9200/twitter21/' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "url_host": {
                    "type": "url",
                    "part": ["host"],
                    "url_decode": false
                }
            },
            "analyzer": {
                "url_host": {
                    "filter": ["url_host"],
                    "tokenizer": "whitespace"
                }
            }
        }
    },
    "mappings": {
        "example_type": {
            "properties": {
                "url": {
                    "type": "multi_field",
                    "fields": {
                        "url": {"type": "string"},
                        "host": {"type": "string", "analyzer": "url_host"}
                    }
                }
            }
        }
    }
}'

{"acknowledged":true}




curl 'http://localhost:9200/twitter21/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'

{
  "tokens" : [ {
    "token" : "foo.bar.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bar.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  } ]
}

I was expecting to retrieve just one token foo.bar.com instead of 3, also i believe the start_offset and end_offset are wrong.

These are the elasticsearch and plugin versions:

./elasticsearch -version
Version: 2.3.4, Build: e455fd0/2016-06-30T11:24:31Z, JVM: 1.8.0_101

 bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.3.4.2/elasticsearch-analysis-url-2.3.4.2.zip

Please correct me if I'm wrong, I hope I'm not missing something else.

Thanks in advance

Mixture of host and path as tokens

Is it possible to get a mixture of host and path as token?

For example, if we have "http://stackoverflow.com/questions/18977834/indexing-website-url-in-elastic-search"

I expect the tokens to be "stackoverflow.com", "stackoverflow.com/questions", "stackoverflow.com/questions/18977834", "stackoverflow.com/questions/18977834/indexing-website-url-in-elastic-search".

I know its possible by using shingle token filter but is there anything inbuilt right in this plugin?

Exception with highlighting and 5.3+

When searching using Kibana and ES 5.5, the search fails with an error in Kibana. Inspecting the ES logs reveals an exception caused by this plugin.

Here is an excerpt from the logs:

[2017-08-11T22:51:20,687][DEBUG][o.e.a.s.TransportSearchAction] [s-bf8a53980a053de6-0] [staging-server-logs-logstash-2017.08.11][2], node[Qb-bgiJQSg-qwPiuUUoTDQ], [P], s[STARTED], a[id=7GCztyuMQrmq7JAw3EV9pw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[staging-server-logs-logstash-2017.08.11], indicesOptions=IndicesOptions[id=39, ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='1502491863910', requestCache=null, scroll=null, source={
  "size" : 500,
  "query" : {
    "bool" : {
      "must" : [
        {
          "query_string" : {
            "query" : "test",
            "fields" : [ ],
            "use_dis_max" : true,
            "tie_breaker" : 0.0,
            "default_operator" : "or",
            "auto_generate_phrase_queries" : false,
            "max_determinized_states" : 10000,
            "enable_position_increments" : true,
            "fuzziness" : "AUTO",
            "fuzzy_prefix_length" : 0,
            "fuzzy_max_expansions" : 50,
            "phrase_slop" : 0,
            "analyze_wildcard" : true,
            "escape" : false,
            "split_on_whitespace" : true,
            "boost" : 1.0
          }
        },
        {
          "range" : {
            "@timestamp" : {
              "from" : 1502405480621,
              "to" : 1502491880621,
              "include_lower" : true,
              "include_upper" : true,
              "format" : "epoch_millis",
              "boost" : 1.0
            }
          }
        }
      ],
      "disable_coord" : false,
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "version" : true,
  "_source" : {
    "includes" : [ ],
    "excludes" : [ ]
  },
  "stored_fields" : "*",
  "docvalue_fields" : [
    "@timestamp"
  ],
  "script_fields" : { },
  "sort" : [
    {
      "@timestamp" : {
        "order" : "desc",
        "unmapped_type" : "boolean"
      }
    }
  ],
  "aggregations" : {
    "2" : {
      "date_histogram" : {
        "field" : "@timestamp",
        "time_zone" : "America/Los_Angeles",
        "interval" : "30m",
        "offset" : 0,
        "order" : {
          "_key" : "asc"
        },
        "keyed" : false,
        "min_doc_count" : 1
      }
    }
  },
  "highlight" : {
    "pre_tags" : [
      "@kibana-highlighted-field@"
    ],
    "post_tags" : [
      "@/kibana-highlighted-field@"
    ],
    "fragment_size" : 2147483647,
    "fields" : {
      "*" : {
        "highlight_query" : {
          "bool" : {
            "must" : [
              {
                "query_string" : {
                  "query" : "test",
                  "fields" : [ ],
                  "use_dis_max" : true,
                  "tie_breaker" : 0.0,
                  "default_operator" : "or",
                  "auto_generate_phrase_queries" : false,
                  "max_determinized_states" : 10000,
                  "enable_position_increments" : true,
                  "fuzziness" : "AUTO",
                  "fuzzy_prefix_length" : 0,
                  "fuzzy_max_expansions" : 50,
                  "phrase_slop" : 0,
                  "analyze_wildcard" : true,
                  "escape" : false,
                  "split_on_whitespace" : true,
                  "all_fields" : true,
                  "boost" : 1.0
                }
              },
              {
                "range" : {
                  "@timestamp" : {
                    "from" : 1502405480621,
                    "to" : 1502491880621,
                    "include_lower" : true,
                    "include_upper" : true,
                    "format" : "epoch_millis",
                    "boost" : 1.0
                  }
                }
              }
            ],
            "disable_coord" : false,
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      }
    }
  }
}}] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [s-bf8a53980a053de6-1][10.244.7.32:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: Error analyzing query text
	at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:344) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.createFieldQuery(MapperQueryParser.java:845) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:512) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:504) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.getFieldQuerySingle(MapperQueryParser.java:247) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.getFieldQuery(MapperQueryParser.java:166) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:851) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:469) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:355) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:244) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:111) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.parse(MapperQueryParser.java:824) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:1036) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:96) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:442) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:416) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:96) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder.transferOptions(HighlightBuilder.java:361) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder.build(HighlightBuilder.java:291) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:686) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:481) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:457) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:253) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.io.IOException: Malformed URL: test
	at org.elasticsearch.index.analysis.url.URLTokenizer.tokenize(URLTokenizer.java:196) ~[?:?]
	at org.elasticsearch.index.analysis.url.URLTokenizer.incrementToken(URLTokenizer.java:125) ~[?:?]
	at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:91) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:294) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.createFieldQuery(MapperQueryParser.java:845) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:512) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:504) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.getFieldQuerySingle(MapperQueryParser.java:247) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.getFieldQuery(MapperQueryParser.java:166) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:851) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:469) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:355) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:244) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:111) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.parse(MapperQueryParser.java:824) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:1036) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:96) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:442) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:416) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:96) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder.transferOptions(HighlightBuilder.java:361) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder.build(HighlightBuilder.java:291) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:686) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:481) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:457) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:253) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
Caused by: java.io.IOException: no protocol: test
	at java.net.URL.<init>(URL.java:593) ~[?:1.8.0_131]
	at java.net.URL.<init>(URL.java:490) ~[?:1.8.0_131]
	at java.net.URL.<init>(URL.java:439) ~[?:1.8.0_131]
	at org.elasticsearch.index.analysis.url.URLTokenizer.tokenize(URLTokenizer.java:174) ~[?:?]
	at org.elasticsearch.index.analysis.url.URLTokenizer.incrementToken(URLTokenizer.java:125) ~[?:?]
	at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:91) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:294) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.createFieldQuery(MapperQueryParser.java:845) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:512) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:504) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.getFieldQuerySingle(MapperQueryParser.java:247) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.getFieldQuery(MapperQueryParser.java:166) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:851) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:469) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:355) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:244) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:111) ~[lucene-queryparser-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.apache.lucene.queryparser.classic.MapperQueryParser.parse(MapperQueryParser.java:824) ~[elasticsearch-5.5.1.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:30:08]
	at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:1036) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:96) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:442) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:416) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:96) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder.transferOptions(HighlightBuilder.java:361) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder.build(HighlightBuilder.java:291) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:686) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:481) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:457) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:253) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]

Searching works fine if highlighting is disabled.

It appears this issue was introduced in 5.3.1 with elastic/elasticsearch#23920.

ElasticSearch 2.3.1 support

Hello, I've tried to add the plugin in ES 2.3.1 but as ElasticSearch lacks backward compatibility the installation has failed. Would it be possible for you to add 2.3.1 support/verify whether there are any changes between 2.3.0 and 2.3.1? It would be great if you could also add an sha1/md5 checksum as Elastic seems to complain about that as well (I do realise it's only a warning, but leaving it as it is could pose a risk to the infrastructure our database is a part of).

Thanks in advance for any help.

Reduplicated the "whole" type and other type when "tokenize_malformed" is true

I don't known is it a bug? Maybe only the other type be left is more better?
PUT /url_test/

{
  "settings": {
      "analysis": {
          "analyzer": {
              "url_analyzer": {
                  "filter": "lowercase",
                  "tokenizer": "url_tokenizer",
                  "type": "custom"
              }
          },
          "tokenizer": {
                "url_tokenizer": {
                  "type": "url",
                  "part" : ["whole", "host", "protocol"],
                   "allow_malformed": "true",
                   "tokenize_host": "false",
                   "tokenize_malformed": "true"   <------ allow tokenize_malformed
              }
          }
      }
  },
  "mappings": {
          "url_t": {
              "properties": {
                   "urls": {
                      "type": "string",
                      "analyzer": "url_analyzer",
                      "include_in_all": "true"
                  }

              }
          }
      }
  }

POST url_test/_analyze?analyzer=url_analyzer

{
  "text" : "example.com"
}

result:

{
  "tokens": [
    {
      "token": "example.com",
      "start_offset": 0,
      "end_offset": 10,
      "type": "whole",
      "position": 0
    },
    {
      "token": "example.com",      <---- Only left this
      "start_offset": 0,
      "end_offset": 11,
      "type": "host",
      "position": 1
    }
  ]
}

`part` option support "multi"

The part options only can be set for all parts or single part. If "multi part" is supported, it will be more flexible.

PUT /url_test/
{
  "settings": {
      "analysis": {
          "analyzer": {
              "url_analyzer2": {
                  "filter": "lowercase",
                  "tokenizer": "url_tokenizer2",
                  "type": "custom"
              }
          },
          "tokenizer": {
                "url_tokenizer2": {
                  "type": "url",
                  "part" : ["whole", "host"],   <---- JUST what whole and host parts
                   "allow_malformed": "true",
                   "tokenize_malformed": "true"
              }
          }
      }
  }
}

Now "http://foo.bar.com/x.html/query?a=123" analysis result is:

{
  "tokens": [
    {
      "token": "a=123",
      "start_offset": 32,
      "end_offset": 37,
      "type": "query",
      "position": 0
    },
    {
      "token": "80",
      "start_offset": 0,
      "end_offset": 0,
      "type": "port",
      "position": 1
    },
    {
      "token": "http://foo.bar.com",
      "start_offset": 0,
      "end_offset": 18,
      "type": "whole",
      "position": 2
    },
    {
      "token": "http",
      "start_offset": 0,
      "end_offset": 4,
      "type": "protocol",
      "position": 3
    },
    {
      "token": "com",
      "start_offset": 15,
      "end_offset": 18,
      "type": "host",
      "position": 4
    },
    {
      "token": "bar.com",
      "start_offset": 11,
      "end_offset": 18,
      "type": "host",
      "position": 5
    },
    {
      "token": "foo.bar.com",
      "start_offset": 7,
      "end_offset": 18,
      "type": "host",
      "position": 6
    },
    {
      "token": "http://foo.bar.com/x.html/query?a=123",
      "start_offset": 0,
      "end_offset": 37,
      "type": "whole",
      "position": 7
    },
    {
      "token": "/x.html",
      "start_offset": 18,
      "end_offset": 25,
      "type": "path",
      "position": 8
    },
    {
      "token": "foo.bar.com:80",
      "start_offset": 0,
      "end_offset": 0,
      "type": "whole",
      "position": 9
    },
    {
      "token": "/x.html/query",
      "start_offset": 18,
      "end_offset": 31,
      "type": "path",
      "position": 10
    }
  ]
}

Unexpected URL highlight result

elasticsearch-analysis-url version: 5.0.0.0
Elasticsearch version: 5.0.0

Hello,
I am using highlight on url-analyzed field. But result is unexpected and different each time searching part of url.

Steps to reproduce:

Create index:

PUT analyzer_test
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "url_tokenizer": {
          "type": "url",
          "url_decode": true,
          "allow_malformed": true,
          "tokenize_malformed": true,
          "tokenize_host": true,
          "tokenize_path": true,
          "tokenize_query": true
        }
      },
      "analyzer": {
        "url_analyzer": {
          "tokenizer": "url_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "url": {
          "type": "text",
          "analyzer": "url_analyzer",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

Add Data:

PUT analyzer_test/url/1
{
  "url": "http://up.kamitu.com:8080/cyupdate/ipad/changyong.plist?dev=ios&resolution=1136&ver=82"
}

Search (result is different each time):

POST analyzer_test/_search
{
  "query": {
    "match": {
      "url": "up.kamitu.com:8080"
    }
  },
  "highlight": {
    "pre_tags": [
      "<b>"
    ],
    "post_tags": [
      "</b>"
    ],
    "fields": {
      "*": {}
    }
  }
}

Having issue with index creation and settings on 2.1.1

For anyone receiving an error like this after installing plugin, make sure to restart your elasticsearch service :-)

{"error":{"root_cause":[{"type":"index_creation_exception","reason":"failed to create index"}],"type":"illegal_argument_exception","reason":"Unknown Tokenizer type [url] for [url_host]"},"status":400}

add option "ignore_protocol"

Could add an option "ignore_protocol" for tokenizer?

Default is false. If true, no protocol in the url still can be tokenized, like 'foo.bar.com/baz.html'

file:/// url exception

I'm on es 1.7. Things blows up on file:/// URI.

$ curl -XPUT bam -d '
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "url_full": {
          "type": "url",
          "allow_malformed": true
        }
      },
      "analyzer": {
        "url_full": {
          "tokenizer": "url_full"
        }
      }
    }
  },
  "mappings": {
    "events": {
      "properties": {
       "url": {
        "type": "string",
        "analyzer": "url_full"
      } 
      }
    }
  }
}'

# blows up on file:///
$ curl 'localhost:9200/bam/_analyze?analyzer=url_full&pretty' -d 'file:///escape/path/to/dream'
{
  "error" : "IllegalArgumentException[startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-1,endOffset=6]",
  "status" : 500
}

# fine for file://
$ curl 'localhost:9200/logs/_analyze?analyzer=url_full&pretty' -d 'file://operabar/escape-two'
{
  "tokens" : [ {
    "token" : "file",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "protocol",
    "position" : 1
  }, {
    "token" : "operabar",
    "start_offset" : 7,
    "end_offset" : 15,
    "type" : "host",
    "position" : 2
  }, {
    "token" : "/escape-two",
    "start_offset" : 15,
    "end_offset" : 26,
    "type" : "path",
    "position" : 3
  }, {
    "token" : "-1",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "port",
    "position" : 4
  }, {
    "token" : "file://operabar/escape-two",
    "start_offset" : 0,
    "end_offset" : 26,
    "type" : "whole",
    "position" : 5
  }, {
    "token" : "operabar:-1",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "whole",
    "position" : 6
  }, {
    "token" : "file://operabar",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "whole",
    "position" : 7
  } ]
}

Enhance the token filter with "pass through" semantics

The token filter provided by this plugin is super useful on fields which only contain an URL. It would be even more useful if we could use it to analyze fields containing arbitrary text possibly with URLs in them (think user comments).

I suggest to add another option called pass_through to the URL Token Filter. If this option is set to true, the token filter would just pass the token through without any modifications instead of just rejecting the document. And allow_malformed would need to be true, I guess.

This would allow to use the url token filter and analyze a free-text user comment with URLs in them. For instance: the text bla bla http://www.google.com bla bla would be tokenized into (using whitespace tokenizer):

  • bla (not a URL, just pass through)
  • bla (not a URL, just pass through)
  • www.google.com
  • bla (not a URL, just pass through)
  • bla (not a URL, just pass through)

Even better, would be to also have the option to set tokenize_host on the token filter, so that the host would be further tokenized into:

Currently, if the url token filter is set up with part: whole and allow_malformed: true, it will stop tokenizing after the first token bla in my example above. The new option I'm suggesting would just tell the token filter to keep going with the other tokens.

Seems to fail on urls without paths

Trying to index something like https://www.google.com throws this error:

java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:854)
    at org.elasticsearch.index.analysis.url.URLTokenFilter.incrementToken(URLTokenFilter.java:101)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:634)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:273)
    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413)
    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1318)
    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1297)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:528)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:119)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

with these analyzer settings

"analysis": {
        "filter": {
          "url_path": {
            "tokenize_path": "false",
            "type": "url",
            "allow_malformed": "true",
            "tokenize_query": "false",
            "part": "path"
          }
        },
        "analyzer": {
          "url_host": {
            "tokenizer": "url_host"
          },
          "url_path": {
            "filter": [
              "url_path"
            ],
            "tokenizer": "whitespace"
          }
        },
        "tokenizer": {
          "url_host": {
            "tokenize_path": "false",
            "type": "url",
            "allow_malformed": "true",
            "tokenize_query": "false",
            "part": "host"
          }
        }
      }

The error seems to have started somewhere between the 1.7.x compatible version and the 2.x compatible version

Querystring parameters as key:value?

Hello,

I got this tokenizer working in ES 5.3. Just changing the version seemed to work. But I'm wondering if there is a way to get the tokenzer to cut things like ?utm_source=twitter into key:value utm_source:twitter?

I didn't see anything in the options that would help with that.

ES 5.0?

Any plans on supporting Elasticsearch 5?

fail to analyze relative protocol

I'm on es 1.7. Relative protocol URI fails to tokenized.

$ curl -XPUT bam -d '
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "url_full": {
          "type": "url",
          "allow_malformed": true
        }
      },
      "analyzer": {
        "url_full": {
          "tokenizer": "url_full"
        }
      }
    }
  },
  "mappings": {
    "events": {
      "properties": {
       "url": {
        "type": "string",
        "analyzer": "url_full"
      } 
      }
    }
  }
}'

# fail to tokenize  //xm.as/santa/claus/
$ curl 'localhost:9200/logs/_analyze?analyzer=url_full&pretty' -d '//xm.as/santa/claus/'
{
  "tokens" : [ {
    "token" : "//xm.as/santa/claus/",
    "start_offset" : 0,
    "end_offset" : 19,
    "type" : "whole",
    "position" : 1
  } ]
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.