Giter Site home page Giter Site logo

schema's Introduction

A modular, open-source search engine for our world.

Pelias is a geocoder powered completely by open data, available freely to everyone.

Local Installation · Cloud Webservice · Documentation · Community Chat

What is Pelias?
Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.

We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.

Pelias Elasticsearch Schema Definition

This package defines the Elasticsearch schema used by Pelias. Pelias requires quite a few settings for performance and accuracy. This repository contains those settings as well as useful tools to ensure they are applied correctly.

Requirements

See Pelias Software requirements for general Pelias requirements.

Installation

$ npm install pelias-schema

Usage

create index

./bin/create_index                          # quick start

drop index

node scripts/drop_index.js                 # drop everything
node scripts/drop_index.js --force-yes     # skip warning prompt

update settings on an existing index

This is useful when you want to add a new analyser or filter to an existing index.

note: it is impossible to change the number_of_shards for an existing index, this will require a full re-index.

node scripts/update_settings.js          # update index settings

output schema file

Use this script to pretty-print the schema's mappings to stdout.

node scripts/output_mapping.js

check all mandatory elasticsearch plugins are correctly installed

Print a list of which plugins are installed and how to install any that are missing.

node scripts/check_plugins.js

Configuration

Settings from pelias.json

Like the rest of Pelias, the Pelias schema can be configured through a pelias.json file read by pelias-config.

schema.indexName

This allows configuring the name of the index created in Elasticsearch. The default is pelias.

Note: All Pelias importers also use this configuration value to determine what index to write to. Additionally, the Pelias API uses the related api.indexName parameter to determine where to read from.

user customizable synonyms files

You may provide your own custom synonyms by editing files in the ./synonyms/ directory.

$ ls -1 synonyms/custom_*
synonyms/custom_admin.txt
synonyms/custom_name.txt
synonyms/custom_street.txt

You must edit the files before running create_index.js, any changes made to the files will require you to drop and recreate the index before those synonyms are available.

Synonyms are only used at index-time. The filename contains the name of the elasticsearch field which the synonyms will apply. ie. custom_name will apply to the name.* fields, custom_street will apply to the address_parts.name field and custom_admin will apply to the parent.* fields.

see: #273 for more info.

With great power comes great responsibility. Synonyms files are often used as a hammer when a scalpel is required. Please take care with their use and make maintainers aware that you are using custom synonyms when you open support tickets.

NPM Module

The pelias-schema npm module can be found here:

https://npmjs.org/package/pelias-schema

You can pull down a versioned copy of the pelias schema from npm:

var schema = require('pelias-schema');

console.log( JSON.stringify( schema, null, 2 ) );

Contributing

Please fork and pull request against upstream master on a feature branch.

Pretty please; provide unit tests and script fixtures in the test directory.

Running Unit Tests

$ npm test

Running Integration Tests

Requires a running elasticsearch server (no other setup required)

$ npm run integration

Running elasticsearch in Docker (for testing purposes)

Download the image and start an elasticsearch docker container:

$ docker run --rm --name elastic-test -p 9200:9200 pelias/elasticsearch:7.5.1

Continuous Integration

CI tests every release against all supported Node.js versions.

schema's People

Contributors

avulfson17 avatar blackmad avatar bradh avatar dianashk avatar echelon9 avatar fdansv avatar greenkeeper[bot] avatar greenkeeperio-bot avatar heffergm avatar hkrishna avatar joxit avatar michaelkirk avatar missinglink avatar nickstallman avatar orangejulius avatar pushkar-geospoc avatar sevko avatar sweco-semhul avatar taygun avatar tigerlily-he avatar trescube avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schema's Issues

Re think admin field analyzer (no tokenizer necessary)

admin fields dont need to be tokenized. think New York and Great Britain. We dont want to match New Jersey when the query is for New York because the whitespace tokenizer split New York into New and York.

However, we need more flexibility compared to keyword analyzer for string matching, fuzzy matching with distance of 1 or 2, case sensitivity etc and most of all being able to search new york, ny and partially match on admin2 field new york or admin1_abbr field ny

phrase slop

terms Lake Cayuga, ny and Cayuga Lake, ny should both return Cayuga Lake, Tompkins County, NY

Explore using not_analyzed for source

The source field currently uses the "keyword" analyzer, which basically keeps the full string as a single token with no changes. According to the keyword analyzer docs, it might make more sense to use the "not_analyzed" setting, briefly touched on here in the docs. It seems like it might do the same thing while somehow being faster.

refresh interval

Set the refresh_interval of the pelias index directly when creating the schema:

settings.js:

    },
    "index": {
      "number_of_replicas": "0",
      "number_of_shards": "1",
      "refresh_interval": "1m",

Allow override in the usual fashion via PELIAS_CONFIG.

expand umlaut, eg. 'ö' -> 'oe'

it's common for users speaking germanic languages to expand ö to oe, especially when stuck using an English keyboard.

an example of this is Löhningen which we can currently surface using Löhningen and the asciifolded version Lohningen but not by the term Loehningen.

other vowels are also affected, they are called "umlaut" in German, there may be an existing Lucene analyser for this which we can use.

ref: https://en.wikipedia.org/wiki/%C3%96
thanks @ralphiech for reporting

hyphenated names

names such as 51 Friedrich-Richter-Straße (address-osmnode-2967205513) should be searchable using the tokens ['friedrich','richter','strasse'] as well as ['friedrichrichterstrasse'] and ['friedrich-richter-strasse']

rethink `address_stop` filter

as per pelias/api#357 if a place name is composed entirely of stop words it results in 0 ngrams being inserted in to the index.

normally the ngram analyser would turn 1 aa street -> [ "1", "a", "aa" ] (removing the 'street'), we do this because otherwise the inverted index for st, rd, roa, avenu etc would be huge and result in slow queries.

in the case where the entire name is composed of stop words it ends up stripping all of the tokens and so we are left with: avenue lane -> [] or vista center -> [].

in combination with using a must match condition means 0 results are returned:

{
   "query": {
      "match": {
         "name.default": {
            "analyzer": "peliasTwoEdgeGram",
            "query": "street avenue road"
         }
      }
   }
}

https://github.com/pelias/schema/blob/master/settings.js#L104

house numbers vs. punctuation

house numbers such as 8/47 are matching the token 847 due to the punctuation being removed and the numbers being concatenated.

tweak boundary mapping per admin-level

All admin-level boundary types currently use the boundary partial mapping, which incurs the following problems:

  • the admin0 type contains admin1 and admin2 values and the admin1 type contains an admin2 value, which is incorrect
  • higher-level admin polygons, like neighborhood, don't have locality or local_admin names

Since different admin-levels have slightly different schemas, we probably won't be able to get away with just one mapping for them all. While we're updating these, we should get rid of the gn_id and woe_id attributes since they're not getting used for anything. I ran into problems with this when implementing pelias-deprecated/quattroshapes#21 (since the importer can't set locality/local_admin values for neighborhood polys).

Increase max gram size

Max gram size should be increased to accommodate for very long hyphenated names such as Saint-Dié-des-Vosges, a setting of ~30 would be good.

non-english characters

our current analyser strategy doesn't seem to account for using non-accented alternatives of non-english characters

http://pelias.mapzen.com/search?input=cinématte&lat=46.961&lon=7.461

vs

http://pelias.mapzen.com/search?input=cinematte&lat=46.961&lon=7.461

@fdansv is there an easy fix for this?

refresh_interval

We used to set a very high refresh_interval but somewhere along the way the setting seems to have been removed.

I'm assuming we can get a performance improvement by reducing refresh to something like 1m instead of 1s, which would mean newly indexed docs wouldn't appear in the search for at most 1m but would mean ES isn't creating new segments all the time.

Additionally this may save RAM as we are building less inverted indexes and less FSTs, plus the commit point log would be 60x smaller, resulting in a faster startup times on larger indexes.

Create index script fails on newer elasticsearch v2.2

I tried running it against latest elasticsearch v2.2 and it fails with error :

[put mapping] pelias { [Error: [mapper_parsing_exception] analyzer on field [neighbourhood_id] must be set when search_analyzer is set]
status: '400',
message: '[mapper_parsing_exception] analyzer on field [neighbourhood_id] must be set when search_analyzer is set',
path: '/pelias',
query: {},
...

Clean up scripts

As a fairly low priority task, it'd be nice to move the tools in the /scripts dir to use something like https://www.npmjs.org/package/commander since they are becoming increasingly dependant on input args and flags; additionally it would be nice to have a --help screen.

Allow override of default database name via pelias config file

The solution to the schema change scenario that came up yesterday in chat: allow the ability to change the database name that the API uses to issue requests to the backend via the pelias config file (currently hardcoded(?) as ‘pelias’).

This would allow a manually rollout when there’s a breaking schema change where the live API would continue to point at the correct index until everything is in place to move to the new one.

Index creation errors

When referencing a config file with multiple elasticsearch hosts, create_index exits with a non zero status as the index has already been created:

---- Begin output of node scripts/create_index.js ----
STDOUT: [put mapping] pelias { [Error: RemoteTransportException[[elasticsearch5.localdomain][inet[/10.0.2.100:9300]][indices/create]]; nested: IndexAlreadyExistsException[[pelias] already exists]; ]
message: 'RemoteTransportException[[elasticsearch5.localdomain][inet[/10.0.2.100:9300]][indices/create]]; nested: IndexAlreadyExistsException[[pelias] already exists]; ' } { error: 'RemoteTransportException[[elasticsearch5.localdomain][inet[/10.0.2.100:9300]][indices/create]]; nested: IndexAlreadyExistsException[[pelias] already exists]; ',
status: 400 }
STDERR:
---- End output of node scripts/create_index.js ----
Ran node scripts/create_index.js returned 1

rethink schema types

We currently define elasticsearch types on a per-dataset basis, so things like GeoNames, OpenAddresses, and OSM all have an individual type (OSM has multiple, in fact: osmaddress, osmnode, and osmway). The majority of these are all aliases for the poi mapping, which just defines a named point, which begs the question why we're bothering with a type per dataset to begin with. I believe it'd be better to have just, say, three types - point (addresses/POIs), street, and boundary (administrative polygons) - for the following reasons:

  1. ease of use: this is really the biggest selling point. If someone wants to index a new dataset in Pelias, they either need to add a new type to the schema or use one of the existing ones. The problem here is that the user needs to learn about the different types, and we need to document that we have a type per dataset (and I don't think there's a clear reason why that is... we just do). People would then have to locally modify pelias-schema and recreate the index. Moreover, people won't get the different types right off the bat: what are openaddresses, geonames, admin1, etc? Then they have to go understand the POI mapping, and so on. It'd be much more intuitive for people to deal with types like point and street, because it should usually be painfully clear what kind of data you have, and those don't require much explanation.
  2. clarity: this was touched on in the previous point, but the type of data seems like a much stronger cause for separation into different elasticsearch types than its origin. It makes sense to separate administrative polygons from addresses/POIs from street linestrings. I don't think it makes as much sense to segregate them by data-source, which seems rather secondary.

Here are the downsides:

  1. we lose the ability to remove a specific dataset by just dropping a type. Kind of a non-issue, since we can just write an ES query to do the same thing and bundle that into a script.

  2. we'll have to:

    1. change record GUIDs: we currently generate pseudo-GUIDs in a variety of ways that differ by dataset (some just use an incremented integer). We'd need to write a separate Transform stream that creates uniform ids for all datasets that are guaranteed to be unique in their elasticsearch type. I'm envisioning this as a separate step of the imports:
    .pipe(peliasSuggester.pipeline)
    .pipe(peliasDocumentGuid()) // <--- record ID set here
    .pipe(peliasDbclient())

    It might just implement something like [esType, originDataset, localId++].join(":"), to generate IDs that look like point.openstreetmap.1, street.tiger.3, boundary.quattroshapes.59, ...

    1. update pelias-schema, and changing the type names referenced by all of our importers.
    2. update our queries ( @hkrishna , how extensive would this be?)

Investigate disabling fielddata for name.* fields

My understanding from reading through the fielddata docs and from the AMA desk at Elasticon is that fielddata is only used for aggregations, sorting, and geo operations on geo fields.

Of those 3, as far as I know, we only do geo operations. However, fielddata is currently enabled on the name.* fields and in dev is currently using 18GB of elasticsearch memory.

If it's really the case that none of our queries are fielddata for anything, I suspect disabling it would at least save us memory, and probably improve performance.

review kstem for autocomplete

review kstem token filter for autocomplete analyses.

eg. 'walking' -> 'walk'
...but 'walki' => '??'

in cases where the stem is not strictly a prefix of the expansion, what happens then?

eg. 'peoples' -> 'person'
... but 'peop' -> '??'

Mismatch in synonym analysis between ngram and phrase analyzers

dev ticket to fix problems noted in pelias/pelias#211

the bug effects two 'classes' of tokens (street suffix synonyms/compass directional synonyms), in both cases it is triggered when the final token of the search text is a synonym. The result is that 0 results are returned:

/v1/autocomplete?text=world trade center      # last token 'center' has a synonym 'ctr'
/v1/autocomplete?text=hackney road            # last token 'road' has a synonym 'rd'
/v1/autocomplete?text=30 west                 # last token 'west' has a synonym 'w'

... all return 0 results

.. however it is not triggered when adding a comma and then specifying an 'admin' component:

/v1/autocomplete?text=30 west, new york

... returns >0 results

The reason this is happening is due to a 'mismatch' between how the 'ngrams' analyzer handles synonyms and how the 'phrase' analyzer handles them.

Since the query is split up in to 'finished' tokens and 'unfinished tokens', these different 'types' of tokens get analyzed in different ways.

Eg. 'world trade center', we know that 'world' and 'trade' are finished (the user is done typing them) but the last term 'center' we are not yet sure if this is a partial word or a complete word.

So the first two tokens get sent to the 'phrase' analyzer which is super efficient while the last token has some tricky analysis applied to it.

Since we don't know if it's complete yet we have to check it against the ngrams index; however we have a performance 'hack' in place which uses the phrase analyzer to produce a single token, so instead of using the ngrams analyzer to produce [ 'c', 'ce', 'cen', 'cent', 'center' ] we just produce [ 'center ], this results in a bit of a performance boost as searching the other prefixes adds no value.

The issue with this is that using the peliasPhrase analyzer against an index created using peliasTwoEdgeGram analysis will not work properly because they handle synonyms differently, in the example above the token created is [ 'ctr' ] not [ 'center' ] as expected, it can't find any docs with the ngram 'ctr' and no results are returned.

in progress, more to come.

Connected to pelias/pelias#211

Use Docvalues when possible

geo_points, numeric fields (popularity, population of type multiplier) can be changed to use Doc Values.

Doc values are only about 10–25% slower than in-memory fielddata, they come with two major advantages:

  • They live on disk instead of in heap memory. This allows you to work with quantities of fielddata that would normally be too large to fit into memory. In fact, your heap space ($ES_HEAP_SIZE) can now be set to a smaller size, which improves the speed of garbage collection and, consequently, node stability.
  • Doc values are built at index time, not at search time. While in-memory fielddata has to be built on the fly at search time by uninverting the inverted index, doc values are prebuilt and much faster to initialize.

The trade-off is a larger index size and slightly slower fielddata access. Doc values are remarkably efficient, so for many queries you might not even notice the slightly slower speed. Combine that with faster garbage collections and improved initialization times and you may notice a net gain.

Create index leads to java.lang.ClassCastException exception in ES logs

Setup was:

  • installed latest develop branch
  • new box on EC2: ubuntu 14.04, elastic search 1.6.0, java 1.7
  • ran node scripts/create_index.js

Checking logs revealed following stack trace:
[2015-07-16 01:02:46,447][INFO ][gateway ] [Nemesis] recovered [1] indices into cluster_state
[2015-07-16 01:02:47,105][WARN ][index.warmer ] [Nemesis] [pelias][0] failed to warm-up global ordinals for [center_point]
java.lang.ClassCastException: org.elasticsearch.index.fielddata.plain.GeoPointDoubleArrayIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexFieldData$Global
at org.elasticsearch.search.SearchService$FieldDataWarmer$3.run(SearchService.java:953)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

MapperParsingException

Hello!

I tried to run the create_index.js as written in your documentation, but I had the following output:

[put mapping] pelias { message: 'MapperParsingException[mapping [_default_]]; nested: MapperParsingException[Unknown field [context]]; ' } { error: 'MapperParsingException[mapping [_default_]]; nested: MapperParsingException[Unknown field [context]]; ', status: 400 }

What have I done wrong ?

shingles

when using ngrams for search the ordering of exact matches is sometimes scored lower than those of less exact matches, eg.:

image

this was addressed in the past with https://github.com/pelias/scripts/blob/master/scripts/exact_match.groovy but I never felt totally comfortable with an exact matching model, because it didn't allow any phrase slop or spelling mistakes.

there is form of tokenization in elasticsearch called 'shingles' which allows us to index groups of adjacent word tokens and use them to score records with partial phrase matches higher.

the gist of it is this:

10 maple street -> [ '10', '10 maple', '10 maple street', 'maple', 'maple street', 'street' ]
105 maple street -> [ '105', '105 maple', '105 maple street', 'maple', 'maple street', 'street' ]

attached below is a link to a playground script which combines 'shingles' with 'ngram' in order to fix the sorting in the screenshot above.

https://github.com/pelias/playground/blob/master/ngram/ngam_street_proximity.js

Storing names in elasticsearch

We currently have all place names stored as key/value pairs under the .name property.

A single entity can have many different names (different languages, common names etc).

This OSM record is a good example:
http://www.openstreetmap.org/way/238241022
http://pelias.mapzen.com/doc?id=osmway:238241022

name The White House
name:de Weißes Haus
name:fa کاخ سفید

The options for storing the data in Elasticsearch are:

one property per name on the document root.

document.name_default = "The White House"
document.name_fa = "کاخ سفید"

This approach would require each property to be explicitly defined at query time in order to tell elasticsearch which properties to search on. Is it possible to alias them?

an array of names:

document.name = [ "The White House", "Weißes Haus", "کاخ سفید" ]

This approach removes the name keys, which means we no longer have any way of telling the origin language and the 'default' name. It makes searching much easier.

a dictionary of names:

document.name = {
  default: "The White House",
  de: "Weißes Haus",
  fa: "کاخ سفید"
}

This is how we current have it configured, it allows us to keep both the key and the value while also allowing us to add/remove elements from the schema. The disadvantage is that you cannot simply query document.name, you MUST query document.name.default or explicitly specify all fields (as above). It also means that the naming schema is not well documented and (besides name.default) can contain arbitrary keys (which makes it harder to query).

This ticket is opened in order to establish that the dictionary approach is the best option and/or to discuss alternate ways of storing names in order to make them easier to search.

Error while creating index

On running

node scripts/create_index.js

I am getting following error. I have elastic search installed and running properly.

[put mapping] pelias { [Error: RemoteTransportException[[White Queen][inet[/192.168.1.49:9300]][indices:admin/create]]; nested: IndexCreationException[[pelias] failed to create index]; nested: ElasticsearchIllegalArgumentException[failed to find analyzer type [pelias-analysis] or tokenizer for [plugin]]; nested: NoClassSettingsException[Failed to load class setting [type] with value [pelias-analysis]]; nested: ClassNotFoundException[org.elasticsearch.index.analysis.pelias-analysis.Pelias-analysisAnalyzerProvider]; ] message: 'RemoteTransportException[[White Queen][inet[/192.168.1.49:9300]][indices:admin/create]]; nested: IndexCreationException[[pelias] failed to create index]; nested: ElasticsearchIllegalArgumentException[failed to find analyzer type [pelias-analysis] or tokenizer for [plugin]]; nested: NoClassSettingsException[Failed to load class setting [type] with value [pelias-analysis]]; nested: ClassNotFoundException[org.elasticsearch.index.analysis.pelias-analysis.Pelias-analysisAnalyzerProvider]; ' } { error: 'RemoteTransportException[[White Queen][inet[/192.168.1.49:9300]][indices:admin/create]]; nested: IndexCreationException[[pelias] failed to create index]; nested: ElasticsearchIllegalArgumentException[failed to find analyzer type [pelias-analysis] or tokenizer for [plugin]]; nested: NoClassSettingsException[Failed to load class setting [type] with value [pelias-analysis]]; nested: ClassNotFoundException[org.elasticsearch.index.analysis.pelias-analysis.Pelias-analysisAnalyzerProvider]; ', status: 400 }

allow override of index settings via custom json

Specifying index settings in PELIAS_CONFIG, e.g.:

{
  "index": {
    "number_of_replicas": "1",
    "number_of_shards": "30",
    "index.index_concurrency": "32"
  }
}

does not currently override the defaults.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.