Giter Site home page Giter Site logo

esdiff's Introduction

Diff for Elasticsearch

Warning: This is a work-in-progress. Things might break without warning.

The esdiff tool iterates over two indices in Elasticsearch 5.x, 6.x or 7.x and performs a diff between the documents in those indices.

It does so by scrolling over the indices. To allow for a stable sort order, it uses _id by default (_uid in ES 5.x).

You need Go 1.11 or later to compile. Install with:

go install github.com/olivere/esdiff@latest

Example usage

First, we need to setup two Elasticsearch clusters for testing, then seed a few documents.

$ mkdir -p data

# Create an Elasticsearch 5.x cluster on http://localhost:19200
# Create an Elasticsearch 6.x cluster on http://localhost:29200
# Create an Elasticsearch 7.x cluster on http://localhost:39200

# Increase your docker memory limit (6.0GiB) in Docker App > Preferences > Advanced.
$ docker-compose up -d

Creating esdiff_elasticsearch5_1 ... done
Creating esdiff_elasticsearch6_1 ... done
Creating esdiff_elasticsearch7_1 ... done

# Check docker containers
$ docker-compose ps
         Name                        Command               State                 Ports
----------------------------------------------------------------------------------------------------
esdiff_elasticsearch5_1   /bin/bash bin/es-docker          Up      0.0.0.0:19200->9200/tcp, 9300/tcp
esdiff_elasticsearch6_1   /usr/local/bin/docker-entr ...   Up      0.0.0.0:29200->9200/tcp, 9300/tcp
esdiff_elasticsearch7_1   /usr/local/bin/docker-entr ...   Up      0.0.0.0:39200->9200/tcp, 9300/tcp

# Check docker container logs
$ docker-compose logs -f elasticsearch5
Attaching to esdiff_elasticsearch5_1
elasticsearch5_1  | [2019-07-02T14:17:33,351][WARN ][o.e.b.JNANatives         ] Unable to lock JVM Memory: error=12, reason=Cannot allocate memory
elasticsearch5_1  | [2019-07-02T14:17:33,355][WARN ][o.e.b.JNANatives         ] This can result in part of the JVM being swapped out.
elasticsearch5_1  | [2019-07-02T14:17:33,355][WARN ][o.e.b.JNANatives         ] Increase RLIMIT_MEMLOCK, soft limit: 83968000, hard limit: 83968000
elasticsearch5_1  | [2019-07-02T14:17:33,356][WARN ][o.e.b.JNANatives         ] These can be adjusted by modifying /etc/security/limits.conf, for example:
elasticsearch5_1  | # allow user 'elasticsearch' mlockall
........

# Add some documents
$ ./seed/01.sh

# Compile
$ go build

Let's make a simple diff:

Examples

Same cluster and same documents should return only unchanged documents:

$ ./esdiff -u=true 'http://localhost:19200/index01/tweet' 'http://localhost:19200/index01/tweet'
Unchanged       1
Unchanged       2
Unchanged       3

The following example will return a diff between indices in ES 5.x and ES 6.x:

$ ./esdiff -u=true 'http://localhost:19200/index01/tweet' 'http://localhost:29200/index01/_doc'
Unchanged       1
Deleted 2
Updated 3       {*diff.Document}.Source["message"]:
        -: "Playing the piano is fun as well"
        +: "Playing the guitar is fun as well"

Created 4       {*diff.Document}:
        -: (*diff.Document)(nil)
        +: &diff.Document{ID: "4", Source: map[string]interface {}{"message": "Climbed that mountain", "user": "sandrae"}}

ES 5.x and ES 7.x—different documents—again:

$ ./esdiff -u=true 'http://localhost:19200/index01/tweet' 'http://localhost:39200/index01/_doc'
Unchanged       1
Deleted 2
Updated 3       {*diff.Document}.Source["message"]:
        -: "Playing the piano is fun as well"
        +: "Playing the flute, oh boy"

Created 5       {*diff.Document}:
        -: (*diff.Document)(nil)
        +: &diff.Document{ID: "5", Source: map[string]interface {}{"message": "Ran that marathon", "user": "sandrae"}}

Output options

Notice that you can pass additional options to filter for the kind of modes that you're interested in. E.g. if you also want to see all unchanged documents but not those that were deleted, use -u=true -d=false:

$ ./esdiff -u=true -d=false 'http://localhost:19200/index01/tweet' 'http://localhost:29200/index01/_doc'
Unchanged       1
Updated 3       {*diff.Document}.Source["message"]:
        -: "Playing the piano is fun as well"
        +: "Playing the guitar is fun as well"

Created 4       {*diff.Document}:
        -: (*diff.Document)(nil)
        +: &diff.Document{ID: "4", Source: map[string]interface {}{"message": "Climbed that mountain", "user": "sandrae"}}

Formatting options

Use JSON as output format instead. Together with jq and jiq this is quite powerful (among other jq-related tools).

$ ./esdiff -o=json 'http://localhost:29200/index01/_doc' 'http://localhost:39200/index01/_doc' | jq 'select(.mode | contains("deleted"))'
{
  "mode": "deleted",
  "_id": "4",
  "src": {
    "_id": "4",
    "_source": {
      "message": "Climbed that mountain",
      "user": "sandrae"
    }
  },
  "dst": null
}

Filtering options

You can also pass a query to filter the source and/or the destination, using the -sf and -df args respectively:

$ $ ./esdiff -o=json -sf='{"term":{"user":"olivere"}}' 'http://localhost:29200/index01/_doc' 'http://localhost:19200/index01/_doc'
{"mode":"deleted","_id":"1","src":{"_id":"1","_source":{"message":"Welcome to Golang","user":"olivere"}},"dst":null}

All options

Use -h to display all options:

$ ./esdiff -h
General usage:

        esdiff [flags] <source-url> <destination-url>

General flags:
  -a    Print added docs (default true)
  -c    Print changed docs (default true)
  -d    Print deleted docs (default true)
  -df string
        Raw query for filtering the destination, e.g. {"term":{"name.keyword":"Oliver"}}
  -dsort string
        Field to sort the destination, e.g. "id" or "-id" (prepend with - for descending)
  -exclude string
        Raw source filter for excluding certain fields from the source, e.g. "hash_value,sub.*"
  -include string
        Raw source filter for including certain fields from the source, e.g. "obj.*"
  -o string
        Output format, e.g. json
  -sf string
        Raw query for filtering the source, e.g. {"term":{"user":"olivere"}}
  -size int
        Batch size (default 100)
  -ssort string
        Field to sort the source, e.g. "id" or "-id" (prepend with - for descending)
  -u    Print unchanged docs
  -replace-with string
        Replace the id in the document with the unique field you need from the source,e.g. "unique_key"

License

MIT. See LICENSE.

esdiff's People

Contributors

dig412 avatar olivere avatar tiezhuli001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

esdiff's Issues

Can I get push permission?

I added ES 7 and fixed bug run es5, 6 docker.

But I cannot push this repository.

remote: Permission to olivere/esdiff.git denied to suhyunjeon.
fatal: unable to access 'https://github.com/olivere/esdiff.git/': The requested URL returned error: 403

catch bug ^_^

I think 'case doc, ok := <-dstCh:' maybe change srcCh.

if srcDoc != nil && dstDoc == nil {
  diffCh <- Diff{Mode: Deleted, Src: srcDoc}
  for {
    select {
    case doc, ok := <-dstCh:
      if !ok {
        return
      }
      diffCh <- Diff{Mode: Deleted, Src: doc}
    case <-ctx.Done():
      errCh <- ctx.Err()
      return
    }
  }
}

Inconsistent results with larger IDs

We've been using esdiff quite a bit and have found it sometimes returns inconsistent results - that is running it multiple times against the same two indexes will produce different results.
It also seems to be missing some new documents entirely.

I've come up with the following test case:

#!/bin/sh
curl -H 'Content-Type: application/json' -XDELETE 'localhost:19201/oldindex'
curl -H 'Content-Type: application/json' -XDELETE 'localhost:19201/newindex'
curl -H 'Content-Type: application/json' -XPUT 'localhost:19201/oldindex/event/239473748' -d '{"id":"239473748","name":"Same Document"}'

curl -H 'Content-Type: application/json' -XPUT 'localhost:19201/newindex/event/239473748' -d '{"id":"239473748","name":"Same Document"}'
curl -H 'Content-Type: application/json' -XPUT 'localhost:19201/newindex/event/34' -d '{"id":"34","name":"New Document"}'
curl -H 'Content-Type: application/json' -XPUT 'localhost:19201/newindex/event/32' -d '{"id":"32","name":"New Document 2"}'

This runs against two indexes on ES6, but I discovered the behaviour on ES5.
I then run the following command repeatedly:

esdiff -a=true -c=true -d=true -u=true http://localhost:19201/oldindex/event http://localhost:19201/newindex/event

And get mostly:

Unchanged       239473748

Sometimes:

Rarely:

Unchanged       239473748
Created 32      {*diff.Document}:
        -: (*diff.Document)(nil)
        +: &diff.Document{ID: "32", Source: map[string]interface {}{"id": "32", "name": "New Document 2"}}

I think this is something to do with the id field - if I use only small numbers it works fine, if some of the documents have larger IDs the problem starts.

Please let me know if you need any more information.

Compilation error

Hello,

I'm trying to compile it on Ubuntu 19.10. I've already upgrade golang to 1.14 using the official ppa.

However, when I try to compile I get the following errors:

/usr/local/go/src/net/http/h2_bundle.go:45:2: package golang_org/x/net/http2/hpack is not in GOROOT (/usr/local/go/src/golang_org/x/net/http2/hpack)
/usr/local/go/src/net/http/h2_bundle.go:46:2: package golang_org/x/net/lex/httplex is not in GOROOT (/usr/local/go/src/golang_org/x/net/lex/httplex)

Doc IDs only?

Hi OliverE, is there any way to only compare document ids? and not diff the contents of the docs? Ideally this would speed up the operation

edit: i wrote up a quick script with your main library, it was faster than i expected, i used the scroll service. Thanks for the great libraries

invalid indirect of hit.Source (type json.RawMessage)

  • go version
go version go1.12.6 linux/amd64
  • errors
$  go install github.com/olivere/esdiff
# github.com/olivere/esdiff/elastic/v6
../go/esdiff/src/github.com/olivere/esdiff/elastic/v6/client.go:137:27: invalid indirect of hit.Source (type json.RawMessage)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.