driebit / mod_elasticsearch Goto Github PK

View Code? Open in Web Editor NEW

1.0 17.0 2.0 254 KB

Zotonic Elasticsearch module

License: MIT License

Erlang 94.21% Smarty 5.79%

erlang elasticsearch zotonic

mod_elasticsearch's Introduction

mod_elasticsearch

This Zotonic module gives you more relevant search results by making resources searchable through
Elasticsearch.

Note

This module uses Elastic Search 5.x, which has support for Types.

For Elastic 7+, use https://github.com/driebit/mod_elasticsearch2

Installation

mod_elasticsearch acts as a bridge between Zotonic and the tsloughter/erlastic_search Erlang library, so install that and its dependencies first by adding them to deps in zotonic.config:

{deps, [
    %% ...
    {erlastic_search, ".*", {git, "https://github.com/tsloughter/erlastic_search.git", {tag, "master"}}},
    {hackney, ".*", {git, "https://github.com/benoitc/hackney.git", {tag, "1.6.1"}}},
    {jsx, ".*", {git, "https://github.com/talentdeficit/jsx.git", {tag, "2.8.0"}}}      
]}

Configuration

To configure the Elasticsearch host and port, edit your erlang.config file:

[
    %% ...
    {erlastic_search, [
        {host, <<"elasticsearch">>}, %% Defaults to 127.0.0.1
        {port, 9200}                 %% Defaults to 9200
    ]},
    %% ...
].

Search queries

When mod_elasticsearch is enabled, it will direct all search operations of the ‘query’ type to Elasticsearch:

z_search:search({query, Args}, Context).

For Args, you can pass all regular Zotonic query arguments, such as:

z_search:search({query, [{hasobject, 507}]}, Context).

Query context filters

The filter search argument that you know from Zotonic will be used in Elasticsearch’s filter context. To add filters that influence score (ranking), use the query_context_filter instead. The syntax is identical to that of filter:

z_search:search({query, [{query_context_filter, [["some_field", "value"]]}]}, Context).

Extra query arguments

This module adds some extra query arguments on top of Zotonic’s default ones.

To find documents that have a field, whatever its value (make sure to pass exists as atom):

{filter, [<<"some_field">>, exists]}

To find documents that do not have a field (make sure to pass missing as atom):

{filter, [<<"some_field">>, missing]}

For a match phrase prefix query, use the prefix argument:

z_search:search({query, [{prefix, <<"Match this pref">>}]}, Context).

To exclude a document:

{exclude_document, [Type, Id]}

To supply a custom function_score clause, supply one or more score_functions. For instance, to rank recent articles above older ones:

z_search:search(
    {query, [
        {text, "Search this"},
        {score_function, #{
            <<"filter">> => [{cat, "article"}],
            <<"exp">> => #{
                <<"publication_start">> => #{
                    <<"scale">> => <<"30d">>
                }
            }
        }}
    ]},
    Context
).

Notifications

elasticsearch_fields

Observe this foldr notification to change the document fields that are queried. You can use Elasticsearch multi_match syntax for boosting fields:

%% your_site.erl

-export([
    % ...
    observe_elasticsearch_fields/3
]).

observe_elasticsearch_fields({elasticsearch_fields, QueryText}, Fields, Context) ->
    %% QueryText is the search query text

    %% Add or remove fields: 
    [<<"some_field">>, <<"boosted_field^2">>|Fields].

elasticsearch_put

Observe this notification to change the resource properties before they are stored in Elasticsearch. For instance, to store their zodiac sign alongside person resources:

%% your_site.erl

-include_lib("mod_elasticsearch/include/elasticsearch.hrl").

-export([
    % ...
    observe_elasticsearch_put/3
]).

-spec observe_elasticsearch_put(#elasticsearch_put{}, map(), z:context()) -> map().
observe_elasticsearch_put(#elasticsearch_put{id = Id}, Props, Context) ->
    case m_rsc:is_a(Id, person, Context) of
        true ->
            Props#{zodiac => calculate_zodiac(Id, Context)};
        false ->
            Props
    end.

Logging

By default, mod_elasticsearch logs outgoing queries at the debug log level. To see them in your Zotonic console, change the minimum log level to debug:

lager:set_loglevel(lager_console_backend, debug).

How resources are indexed

Content in all languages is stored in the index, following the one language per field strategy:

Each translation is stored in a separate field, which is analyzed according to the language it contains. At query time, the user’s language is used to boost fields in that particular language.

mod_elasticsearch's People

Contributors

Stargazers

Watchers

Forkers

hungbuifut emiflake

mod_elasticsearch's Issues

Support cat_exact

We currently store all categories a resource belongs to. Elasticsearch does not support searching in only the last value (which in our case is the most specific category, which we need for cat_exact), so we need to store the most specific category separately.

/cc @DorienD

Adding/deleting an edge doesn't update the index of the object

Use case:
A query with: hasobject=1234
shows all rsc's with keyword 1234
When adding a new rsc with keyword 1234 it won't show in the query until a complete re-index.

Deleting resources fires both an rsc_delete and rcs_update_done

First a delete, then an update is being fired, reinserting the deleted resource, suspecting the empty resource to appear at the top of the results from most search queries.

Convert all is_* properties to boolean

As long as booleans come in, that’s fine. Unfortunately, devs sometimes use 0/1 for false/true, which will (correctly) be interpreted as a numeric field by Elasticsearch. Just do a z_convert:to_bool to enforce boolean.

Mapping upgrades

In general, existing field mappings cannot be updated.

To be able to change mappings (both dynamic mappings and explicit mappings) the index needs to be recreated. Offer a way to ‘upgrade’ an index when manage_schema contains a mapping change:

version the index
change the current index name to an alias that points to the latest version
when a mapping change occurs, apply it to index{current_version+1}, reindex the data using the reindex API
switch the alias when reindexing has finished (reindex is synchronous).

Fall back to PostgreSQL if Elasticsearch is unreachable

If Elasticsearch is unreachable, fall back to the PostgreSQL data source. Let’s try try to differentiate between:

unreachable Elasticsearch: perhaps retry once, then fall back to PostgreSQL
error in Elasticsearch query: do not fall back but show error page (so devs/editors know they made an error in the template or search query resource).

Implement search queries

Add status button to re-install the Elastic index

If a document is inserted without first creating the index and installing the mappings, then the default mappings are used.

These default mappings are wrong, especially for related objects.

If such a wrong index exists, or if mappings are changed, then we need a simple method to re-install the index with the new mappings.

Proposal is to add a button to /admin/status to drop the index, re-create the index, and start a re-pivot of all resources.

Better autocompletion

We now use a regular (match) query, which uses analyzed queries and terms. This is suboptimal for matching partial (prefix) strings.

See http://rea.tech/implementing-autosuggest-in-elasticsearch/ for options.

Promote recent resources

Using a decay on publication_start/date_start.

Denormalise related resources

Currently this module leaves out related resources (Zotonic includes their titles to find resources by their relations). Using the pivot templates in Zotonic 1.0, we can include them again. Remember that we already have the elasticsearch_put notification for really customised inclusion of related resource terms.

Searching for id:unique_name not supported

Optimise mappings

This module already has some good mappings, but we may be able to improve upon them. For inspiration, cf. the mappings used in mod_search_solr.

postal codes not_analyzed
don’t index summary_html, body_html etc.
store less data, as we only need to return the ids (and some facets). Fields are not stored by default, so just add to ignored_props to get rid of them in _source.

Use separate Hackney pools for reads and writes

A large number of writes (e.g. when indexing a large dataset) can cause the Hackney pool used by erlastic_search to become congested. Raising the maximum number of connections for that pool helps but is no definitive solution, because writes still contend with reads.

Therefore, it’s better to use two separate pools: one for all reads (GET/HEAD requests) and another for all indexing operations (PUT/POST/DELETE).

Boost specific categories

Elasticsearch has an elegant way to prefer resources in specific categories (using should with an optional boost):

GET yoursite/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "category": {
              "query": "person",
              "boost": 2
            }
          }
        }
      ],
      "must": [
        {
          "multi_match": {
            "query": "james bond",
            "fields": [
              "title_*"
            ]
          }
        }
      ]
    }
  }

Zotonic only supports must match for a category. So we need to either:

add custom m.search properties that only mod_elasticsearch supports (e.g., prefer_cat[s]). Disadvantage: the extended query breaks when mod_elasticsearch is disabled and Zotonic falls back to the default PostgreSQL search.
or add a custom m.elasticsearch model to make it very explicit that the search query depends on mod_elasticsearch. Disadvantage: requires close coupling between templates and the search engine.

ERROR: OTP release 18 does not match required regex R15|R16|17
ERROR: compile failed while processing /opt/zotonic/deps/idna: rebar_abort
GNUmakefile:56: recipe for target 'compile' failed
make: *** [compile] Error 1

tsloughter/erlastic_search depends on hackney, which depends on idna.