Giter Site home page Giter Site logo

moshe / elasticsearch_loader Goto Github PK

View Code? Open in Web Editor NEW
398.0 398.0 83.0 134 KB

A tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch

License: MIT License

Python 98.89% Dockerfile 1.11%
csv elasticsearch elasticsearch-loader json logstash parquet python

elasticsearch_loader's People

Contributors

ayaka-soeda avatar dependabot-preview[bot] avatar digenis avatar kmkamonseki avatar lxj616 avatar moshe avatar nestdream avatar rtim0 avatar skyberd avatar stickler-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch_loader's Issues

Does not support python3

Traceback (most recent call last):
  File "/usr/bin/elasticsearch_loader", line 7, in <module>
    from elasticsearch_loader import cli
  File "/usr/lib/python3.5/site-packages/elasticsearch_loader.py", line 7, in <module>
    from click_stream import Stream
  File "/usr/lib/python3.5/site-packages/click_stream.py", line 3, in <module>
    from urlparse import urlparse
ImportError: No module named 'urlparse'

Document how to invoke elasticsearch_loader as a Python Module

Hi, I'd like to integrate use of elasticsearch_loader in my Python project where I could invoke it as a module. But going by your docs I'm not sure how to do that. I can invoke it through subprocess or some shellexec call, but I'd rather directly trigger it as a Python module. Seems like this should be very possible. An example or two in the README.md would be fantastic.

Can we skip if a document already exist ? ?

i was loading a 128gb file :) it was all going well, almost around 74% done. it failed..
as i read, default it does is overwrite.. is there any way i can restart loading data and skip the existing documents ?

Importing numeric data

Hi,
I'm importing my CSV files into elasticsearch to make visualizations in Kibana.
As I recognized, the script will quote every field in the CSV, so the field type is always text.
Is it possible to make a CSV file specific configuration with field to type mappings or can you implement an autodetection of field types of some generic types such as int, float, text, timestamp.
Or, which will also be an alternative, instead of importing the data into elasticsearch, dump the data as json file (for manual changes). The import with command "json" can be done afterwards.

Regards,
cybcon

No support for Elasticsearch 5.0.0 GA

No support for Elasticsearch 5.0.0 GA, see error below

"Uploading [###################################-] 98% 00:00:052016-10-31 00:09:51.475786 WARN Chunk 101 got exception (TransportError(400, u'illegal_argument_exception', u'request [/_bulk] contains unrecognized parameter: [consistency]')) while processing"

usage

is it possible to import elasticsearch_loader like other libs without install it and call it inside python? i use os.system(command) but i want to use inside executable file

thanks

Get keys from header row

Hello! In the case of csv files, can you get the keys from the header row?

My goal is to load a csv file like this:

id, first_name, last_name
1, thomas, kane
2, john, do

And see something like this loaded:

{
        "_index" : "people",
        "_type" : "person",
        "_id" : "IuOkjG8Bny05PI-SGOq0",
        "_score" : 1.0,
        "_source" : {
                "id" : "1",
                "first_name" : "Thomas",
                "last_name" : "Kane"
        }
}

Is this possible?

Thank!

Getting `Specifying types in bulk requests is deprecated.` message

Any thoughts why I am getting this message? is this an error?

$ elasticsearch_loader --index example4   csv example_file.csv 
{'index': 'example4', 'bulk_size': 500, 'es_host': ('http://localhost:9200',), 'verify_certs': False, 'use_ssl': False, 'ca_certs': None, 'http_auth': None, 'delete': False, 'update': False, 'progress': False, 'type': '_doc', 'id_field': None, 'as_child': False, 'with_retry': False, 'index_settings_file': None, 'timeout': 10.0, 'encoding': 'utf-8', 'keys': [], 'es_conn': <Elasticsearch([{'host': 'localhost', 'port': 9200}])>}
/Users/danielk/opt/anaconda3/lib/python3.7/site-packages/elasticsearch/helpers/actions.py:122: ElasticsearchDeprecationWarning: [types removal] Specifying types in bulk requests is deprecated.
  resp = client.bulk("\n".join(bulk_actions) + "\n", *args, **kwargs)```

Error trying to load JSON input to ES

Hey! I'm trying to load some datas to ES. When i try with POST, it adds.

POST titles/_doc/
{
        "CourseId":35,
        "UnitId":12390,
        "title":"Some titles",
        "id":"16069",
        "CourseName":"Some names",
        "FieldId":8,
        "field":"Some field"
  
}

I'm trying with this command elasticsearch_loader --index incidents --type incident json titles.json

I got this error

$ elasticsearch_loader --index incidents --type incident json titles.json 
Traceback (most recent call last):
  File "/usr/local/bin/elasticsearch_loader", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/elasticsearch_loader/__init__.py", line 159, in main
    cli()
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/elasticsearch_loader/__init__.py", line 135, in _json
    lines = chain(*(json.load(x) for x in files))
  File "/usr/local/lib/python3.6/site-packages/elasticsearch_loader/__init__.py", line 135, in <genexpr>
    lines = chain(*(json.load(x) for x in files))
  File "/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 8 column 25 (char 222)

Total data is 170123

And my titles.json file like

{...},
{...},
      {
        "CourseId":35,
        "UnitId":12390,
        "title":"xxx",
        "id":"16069",
        "CourseName":"xxx",
        "FieldId":8,
        "field":"xxx"},
      {
        "CourseId":48,
        "UnitId":396,
        "title":"xxx",
        "id":"16070",
        "CourseName":"xxx",
        "FieldId":29,
        "field":"xxx"},
{...},
{...}

Any help?

Loading parquet files from s3 into Elasticsearch

Hi, I was taking a look at the library code and noticed that the esl-s3 only supports json. Is there a PR I can work on that provides other formats too? In my use case its parquet and I know you have a parquet loader. Wondering if I could reuse that module with s3?

Thanks in advance.

elasticsearch_loader in ubuntu: MemoryError

Hi, I have the same file loaded from my mac to AWS ES, works. But, the same code in Ubuntu got this error msg.

'Connecting to %s using SSL with verify_certs=False is insecure.' % host)
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/elasticsearch_loader", line 11, in
load_entry_point('elasticsearch-loader==0.2.6', 'console_scripts', 'elasticsearch_loader')()
File "/home/ubuntu/.local/lib/python3.5/site-packages/elasticsearch_loader/init.py", line 160, in main
cli()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), args, **kwargs)
File "/home/ubuntu/.local/lib/python3.5/site-packages/elasticsearch_loader/init.py", line 136, in _json
lines = chain(
(json.load(x) for x in files))
File "/home/ubuntu/.local/lib/python3.5/site-packages/elasticsearch_loader/init.py", line 136, in
lines = chain(
(json.load(x) for x in files))
File "/usr/lib/python3.5/json/init.py", line 265, in load
return loads(fp.read(),
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

Parquet groupings

When doing a directory of Parquet files (an option to do a directory without having to pass each file would be great :) I noticed that the time before the ingest starts gets progressively longer the more files there are. On a smaller machine, it would just fail without any sort of warning or error, nothing would happen. In an effort to trouble shoot this, I started upping the number of files, and found that the machine size mattered (I could do all 18 files on a larger machine (128 GB of ram, 24 core) vs a 32GB of ram 8 core machine.

Would it be possible to run parquet files in batches as well? So that the required resources would be smaller and serialized. This sorta corresponds to #12 as it would require some sort of "append".

Unable to run on mac

I was able to install this on a mac but I'm unable to run this on a mac. It returns the below error. Any thoughts? Is this platform dependent? Is this only tested on linux?

Traceback (most recent call last):
File "/Users/jenkins/Library/Python/3.6/bin/elasticsearch_loader", line 11, in
load_entry_point('elasticsearch-loader==0.2.2', 'console_scripts', 'elasticsearch_loader')()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 565, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 2631, in load_entry_point
return ep.load()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 2291, in load
return self.resolve()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 2297, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/Users/jenkins/Library/Python/3.6/lib/python/site-packages/elasticsearch_loader/init.py", line 8, in
from click_conf import conf
ModuleNotFoundError: No module named 'click_conf'

jsonl not working

I am trying to load a collection of json files.

Here is the content of my jsonl file:

{"text":"When did the Scholastic Magazine of Notre dame begin publishing?","pIdx":0,"qIdx":0}
{"text":"How often is Notre Dame's the Juggler published?","pIdx":0,"qIdx":1}

and here is the error I am getting:

huntsman-ve500-0792:other daniel$ elasticsearch_loader --index questions --type train json squad_train_questions2.jsonl 
Traceback (most recent call last):
  File "/Users/daniel/miniconda3/bin/elasticsearch_loader", line 11, in <module>
    sys.exit(main())
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 160, in main
    cli()
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 136, in _json
    lines = chain(*(json.load(x) for x in files))
  File "/Users/daniel/miniconda3/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 136, in <genexpr>
    lines = chain(*(json.load(x) for x in files))
  File "/Users/daniel/miniconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/Users/daniel/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/Users/daniel/miniconda3/lib/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 94)

Specific Keys of Json

I'd wish to load only specific keys of json file to ES. Would it be able to support ?
Example:
json --keys a,b.c,d [json_files]

Add a --no-type option?

As expected, the TYPE field is no longer necessary in ES 7.X. When importing to an ES7.X cluster I'm getting the error:

'reason': 'The [default] mapping cannot be updated on index [blah]: defaults mappings are not useful anymore now that indices can have at most one type.', 'type': 'illegal_argument_exception'

When I try to designate the index as "", it throws the error:

type is missing

Can we add a '--no-type' flag to allow for ES7.X until 6.X is EOL?

Thanks for this tool!

Unicode issues loading from utf-8 csv

I see

WARN Chunk 0 got exception ('ascii' codec can't decode byte 0xc3 in position 1548: ordinal not in range(128)) while processing

when loading a utf-8 encoded csv file

List data being indexed as Text/Keyword

When I try to index an array of Text or Keyword, the data is indexed as a Text/keyword, and not as an array. Check the csv sample and ES index below:

CSV

title year director stars genres
Inception 2010 Christopher Nolan [ "Leonardo DiCaprio", "Joseph Gordon-Levitt", "Ellen Page", "Ken Watanabe"] ["Action", "Adventure", "Sci-Fi" ]

Index

{
  "movies": {
    "aliases": {},
    "mappings": {
      "_doc": {
        "properties": {
          "director": {
            "type": "text"
          },
          "genres": {
            "type": "text"
          },
          "stars": {
            "type": "text"
          },
          "title": {
            "type": "text"
          },
          "year": {
            "type": "integer"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1543666909596",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "KRE71VReQWeBRfKaf_DCyw",
        "version": {
          "created": "6020499"
        },
        "provided_name": "movies"
      }
    }
  }
}

Expected indexed document

      {
        "_index": "movies",
        "_type": "_doc",
        "_id": "8",
        "_score": 1,
        "_source": {
          "title": "Inception",
          "year": 2010,
          "director": "Christopher Nolan",
          "stars": [
            "Leonardo DiCaprio",
            "Joseph Gordon-Levitt",
            "Ellen Page",
            "Ken Watanabe"
          ],
          "genres": [
            "Action",
            "Adventure",
            "Sci-Fi"
          ]
        }
      }

Actual document indexed

      {
        "_index": "movies",
        "_type": "_doc",
        "_id": "jaVdwWkB3XGrwhp_3RfU",
        "_score": 1,
        "_source": {
          "director": "Christopher Nolan",
          "genres": """
[
  "Action",
  "Adventure",
  "Sci-Fi"
]
""",
          "year": "2010",
          "stars": """
[
  "Leonardo DiCaprio",
  "Joseph Gordon-Levitt",
  "Ellen Page",
  "Ken Watanabe"
]
""",
          "title": "Inception"
        }
      }

Issue with Date time fields

Hey,

Excellent tool, saves us ton of time, however if we import an csv with loads of date fields via csv, somehow it loads everything as a string, not sure if this is tool or Elasticsearch settings!

Thanks a ton anyways!

Regards,
Vivek

Is "append" to index an option

I am looking through both docs, code, and test usage, and perhaps I am missing something, but it appears the options are A. Index does not exist, and on a single run of elasticsearch_loader, you can load an entire index or B. Index does exist and you specified --delete therefore the existing index will be deleted. When you try to run elasticsearch loader on an index that already exists, but do not specify --delete, it fails with error: elasticsearch.exceptions.RequestError: TransportError(400, u'index_already_exists_exception')

Thus, my request, if possible, is "append" to existing index.

Rejecting mapping update to [audits] as the final mapping would have more than 1 type

Command:

elasticsearch_loader --index-settings-file audit_mapping.json --index audits --http-auth elastic:PASSWORD --type audit csv audit.csv

MAPPING FILE (audit_mapping.json):

{
  "settings" : {
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "module": {
        "type": "keyword"
      },
      "action": {
        "type": "keyword"
      },
      "occured_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

CSV File (audit.csv_:

"id";"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"
"001677f8-6e03-4ea2-bc5c-442e19477189";"firewall";"activated";"Local activated on lan6";0;"2019-03-08 13:24:11";"";"2019-03-08 13:24:11";"2019-03-08 13:24:11"
"001e28e3-8f3a-40a1-9d74-2f1720f9612d";"analysis";"initialising";"Ready.";0;"2019-03-20 16:15:46";"";"2019-03-20 16:15:46";"2019-03-20 16:15:46"
"0022d1da-398e-4026-ac58-899201fb36e9";"AuthorisationService";"Login";"Logging in ggreen user";0;"2019-03-19 14:52:54";"";"2019-03-19 14:52:54";"2019-03-19 14:52:54"
"0044084a-a749-4aa6-af05-4deba6a38432";"UserService";"UnBlock";"UnBlock user: ppurple";0;"2019-03-19 15:37:31";"ad223ccd-b6f9-4bd8-868d-4fae2e95d9a2";"2019-03-19 15:37:31";"2019-03-19 15:37:31"

Errors:

elasticsearch_loader --index-settings-file integra_audit_mapping.json --index audits --http-auth elastic:rHwdTwdvfGxKlhyApO34 --type audit csv integra_audit2.csv
{'index_settings_file': <_io.BufferedReader name='integra_audit_mapping.json'>, 'index': 'audits', 'http_auth': 'elastic:rHwdTwdvfGxKlhyApO34', 'type': 'audit', 'bulk_size': 500, 'es_host': ('http://localhost:9200',), 'verify_certs': False, 'use_ssl': False, 'ca_certs': None, 'delete': False, 'update': False, 'progress': False, 'id_field': None, 'as_child': False, 'with_retry': False, 'timeout': 10.0, 'encoding': 'utf-8', 'keys': [], 'es_conn': <Elasticsearch([{'host': 'localhost', 'port': 9200}])>}
2020-01-30 15:07:40.873155 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2020-01-30 15:07:40.873255 WARN Chunk 0 got exception (('4 document(s) failed to index.', [{'index': {'_index': 'audits', '_type': 'audit', '_id': 'lXT-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '001677f8-6e03-4ea2-bc5c-442e19477189;"firewall";"activated";"Local activated on lan6";0;"2019-03-08 13:24:11";"";"2019-03-08 13:24:11";"2019-03-08 13:24:11"')])}}, {'index': {'_index': 'audits', '_type': 'audit', '_id': 'lnT-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '001e28e3-8f3a-40a1-9d74-2f1720f9612d;"analysis";"initialising";"Ready.";0;"2019-03-20 16:15:46";"";"2019-03-20 16:15:46";"2019-03-20 16:15:46"')])}}, {'index': {'_index': 'audits', '_type': 'audit', '_id': 'l3T-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '0022d1da-398e-4026-ac58-899201fb36e9;"AuthorisationService";"Login";"Logging in ggreen user";0;"2019-03-19 14:52:54";"";"2019-03-19 14:52:54";"2019-03-19 14:52:54"')])}}, {'index': {'_index': 'audits', '_type': 'audit', '_id': 'mHT-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '0044084a-a749-4aa6-af05-4deba6a38432;"UserService";"UnBlock";"UnBlock user: ppurple";0;"2019-03-19 15:37:31";"ad223ccd-b6f9-4bd8-868d-4fae2e95d9a2";"2019-03-19 15:37:31";"2019-03-19 15:37:31"')])}}])) while processing
Traceback (most recent call last):
  File "/home/sharry/.local/bin/elasticsearch_loader", line 11, in <module>
    load_entry_point('elasticsearch-loader==0.6.0', 'console_scripts', 'elasticsearch_loader')()
  File "/home/sharry/.local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/sharry/.local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/sharry/.local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sharry/.local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sharry/.local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/sharry/.local/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 134, in _csv
    load(lines, ctx.obj)
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 53, in load
    single_bulk_to_es(bulk, config, config['with_retry'])
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 37, in single_bulk_to_es
    raise e
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch_loader/__init__.py", line 28, in single_bulk_to_es
    helpers.bulk(config['es_conn'], bulk, chunk_size=config['bulk_size'])
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 304, in bulk
    for ok, item in streaming_bulk(client, actions, *args, **kwargs):
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 234, in streaming_bulk
    **kwargs
  File "/home/sharry/.local/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 162, in _process_bulk_chunk
    raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
elasticsearch.helpers.errors.BulkIndexError: ('4 document(s) failed to index.', [{'index': {'_index': 'audits', '_type': 'audit', '_id': 'lXT-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '001677f8-6e03-4ea2-bc5c-442e19477189;"firewall";"activated";"Local activated on lan6";0;"2019-03-08 13:24:11";"";"2019-03-08 13:24:11";"2019-03-08 13:24:11"')])}}, {'index': {'_index': 'audits', '_type': 'audit', '_id': 'lnT-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '001e28e3-8f3a-40a1-9d74-2f1720f9612d;"analysis";"initialising";"Ready.";0;"2019-03-20 16:15:46";"";"2019-03-20 16:15:46";"2019-03-20 16:15:46"')])}}, {'index': {'_index': 'audits', '_type': 'audit', '_id': 'l3T-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '0022d1da-398e-4026-ac58-899201fb36e9;"AuthorisationService";"Login";"Logging in ggreen user";0;"2019-03-19 14:52:54";"";"2019-03-19 14:52:54";"2019-03-19 14:52:54"')])}}, {'index': {'_index': 'audits', '_type': 'audit', '_id': 'mHT-9m8BOz4WhPcWVe2O', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Rejecting mapping update to [audits] as the final mapping would have more than 1 type: [_doc, audit]'}, 'data': OrderedDict([('id;"module";"action";"description";"level";"occured_at";"created_user";"createdAt";"updatedAt"', '0044084a-a749-4aa6-af05-4deba6a38432;"UserService";"UnBlock";"UnBlock user: ppurple";0;"2019-03-19 15:37:31";"ad223ccd-b6f9-4bd8-868d-4fae2e95d9a2";"2019-03-19 15:37:31";"2019-03-19 15:37:31"')])}}])

Unable to load parquet files into local Elasticsearch

So I've just installed elasticsearch-loader according to all the steps provided in docs, and also elasticsearch-loader[parquet].

However, whenever I try to send a parquet file into my local system the following prompts.

Traceback (most recent call last):
  File "/usr/local/bin/elasticsearch_loader", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/elasticsearch_loader/__init__.py", line 159, in _parquet
    load(lines, ctx.obj)
  File "/usr/local/lib/python2.7/site-packages/elasticsearch_loader/__init__.py", line 51, in load
    for i, bulk in enumerate(pbar):
  File "/usr/local/lib/python2.7/site-packages/click/_termui_impl.py", line 285, in generator
    for rv in self.iter:
  File "/usr/local/lib/python2.7/site-packages/parquet/__init__.py", line 420, in DictReader
    for row in reader(file_obj, columns):
  File "/usr/local/lib/python2.7/site-packages/parquet/__init__.py", line 438, in reader
    schema_helper = schema.SchemaHelper(footer.schema)
  File "/usr/local/lib/python2.7/site-packages/parquet/schema.py", line 24, in __init__
    assert len(self.schema_elements) == len(self.schema_elements_by_name)
AssertionError

Any clue on why this could happen?

Error: no such option: --index-settings-file

When the command bellow is runned, throws the error: "Error: no such option: --index-settings-file"

elasticsearch_loader --index incidents --type incident parquet /tmp/part-00000-4078ca87-b779-4960-9b15-e61ca69d2ece-c000.snappy.parquet --index_settings_file /tmp/settings.json

Numeric Data upload using custom mapping file

@moshe I'd like to write my own custom mapping file to upload numeric data with relevant fields for me to visualize in Kibana. Could you possibly show me an example of a mapping file that I'd have to reference with the --index-settings-file so that I can do the same?

Unable to upload tsv file properly

When I try to load in a TSV (exported from Excel as 'Unicode Text), it imports the whole header row as one field key, and then each row as the field value. It doesn't appear to separate fields out by the tab.

Cannot execute on a mac

I was able to install this on a mac but I'm unable to run this on a mac. It returns the below error. Any thoughts? Is this platform dependent? Is this only tested on linux?

Traceback (most recent call last):
File "/Users/jenkins/Library/Python/3.6/bin/elasticsearch_loader", line 11, in
load_entry_point('elasticsearch-loader==0.2.2', 'console_scripts', 'elasticsearch_loader')()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 565, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 2631, in load_entry_point
return ep.load()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 2291, in load
return self.resolve()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkg_resources/init.py", line 2297, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/Users/jenkins/Library/Python/3.6/lib/python/site-packages/elasticsearch_loader/init.py", line 8, in
from click_conf import conf
ModuleNotFoundError: No module named 'click_conf'

ReadTimeoutError

Any chance to increase the timeout? Got the following timeout on initial index loading with settings configuration.

Thanks

raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.3.54.42', port=9200): Read timed out. (read timeout=10)

'Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes

Hi.

I'm trying to use Elasticsearch Head (chrome plugin) to query an existing elasticsearch index. I'm writing out the results in either CSV or JSON to a file.

I'm then trying to use elasticsearch_loader to load the data into another elasticsearch instance.

Every time I try, however, I get the following error:

_'status': 400, 'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'not_x_content_exception', 'reason': 'Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes'}}, 'data': 'index'}}

Is it possible (using elastichsearch_loader) to export from one elasticsearch instance (using something like Elasticsearch Head) and import it into another instance without having to change the format/data on the export?

Thanks

TSV import : TypeError: "delimiter" must be a 1-character string

I use CentOS Linux release 7.6.1810 with python 3.6 (python36-3.6.8-1.el7.x86_64 / python36-pip-8.1.2-8.el7.noarch). elasticsearch_loader was installed with pip (pip3.6 install elasticsearch_loader) with no errors.

On import tsv to elasticsearch i got:
elasticsearch_loader --type poland_geonames --index poland_geonames csv --delimiter="\t" poland_100.tsv {'type': 'poland_geonames', 'index': 'poland_geonames', 'bulk_size': 500, 'es_host': 'http://localhost:9200', 'verify_certs': False, 'use_ssl': False, 'ca_certs': None, 'http_auth': None, 'delete': False, 'update': False, 'progress': False, 'id_field': None, 'as_child': False, 'with_retry': False, 'index_settings_file': None, 'timeout': 10.0, 'encoding': 'utf-8', 'keys': [], 'es_conn': <Elasticsearch([{'host': 'localhost', 'port': 9200}])>} Traceback (most recent call last): File "/usr/local/bin/elasticsearch_loader", line 11, in <module> load_entry_point('elasticsearch-loader==0.4.0', 'console_scripts', 'elasticsearch_loader')() File "/usr/local/lib/python3.6/site-packages/click/core.py", line 722, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.6/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.6/site-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/lib/python3.6/site-packages/elasticsearch_loader/__init__.py", line 130, in _csv lines = chain(*(csv.DictReader(x, delimiter=str(delimiter)) for x in files)) File "/usr/local/lib/python3.6/site-packages/elasticsearch_loader/__init__.py", line 130, in <genexpr> lines = chain(*(csv.DictReader(x, delimiter=str(delimiter)) for x in files)) File "/usr/lib64/python3.6/csv.py", line 87, in __init__ self.reader = reader(f, dialect, *args, **kwds) TypeError: "delimiter" must be a 1-character string

Sample file:
sample_file.tsv.txt

Greetings!!!

Issue with Date time fields

Hey,

Excellent tool, saves us ton of time, however if we import an csv with loads of date fields via csv, somehow it loads everything as a string, not sure if this is tool or Elasticsearch settings!

Thanks a ton anyways!

Regards,
Vivek

406, u'Content-Type header [] is not supported

C:\Users\1234\Desktop\elasticsearch_loader-master\elasticsearch_loader-master>elasticsearch_loader --es-host http://192.168.1.139:9200 --index index --type index csv telekom_sirovi_podaci.csv
2018-08-07 20:30:55.266000 INFO Loading into ElasticSearch
[------------------------------------]2018-08-07 20:30:55.319000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.320000 WARN Chunk 0 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.360000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.361000 WARN Chunk 1 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.399000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.402000 WARN Chunk 2 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.443000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.444000 WARN Chunk 3 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.483000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.486000 WARN Chunk 4 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.525000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.525000 WARN Chunk 5 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.566000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.569000 WARN Chunk 6 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.609000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.610000 WARN Chunk 7 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.649000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.652000 WARN Chunk 8 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.693000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.693000 WARN Chunk 9 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing
2018-08-07 20:30:55.733000 ERROR attempt [1/1] got exception, it is a permanent data loss, no retry any more
2018-08-07 20:30:55.736000 WARN Chunk 10 got exception (TransportError(406, u'Content-Type header [] is not supported')) while processing

ES Must now be specified content-type

https://www.elastic.co/blog/strict-content-type-checking-for-elasticsearch-rest-requests

elasticsearch.exceptions.SSLError

Hi!

Even without forcing verifying the certificates (and even with a valid certificate) I'm getting this error when connecting:

elasticsearch.exceptions.SSLError: ConnectionError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1056)) caused by: SSLError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1056))

Also, even though specifying that the protocol in the es_host parameter is https, when initializing, the scheme parameter in the es_conn JSON says http (I don't know if this has something to do with it):

{'index': '[REDACTED]', 'es_host': ('https://[REDACTED],), 'http_auth': '[REDACTED]', 'bulk_size': 500, 'verify_certs': False, 'use_ssl': True, 'ca_certs': None, 'delete': False, 'update': False, 'progress': False, 'type': '_doc', 'id_field': None, 'as_child': False, 'with_retry': False, 'index_settings_file': None, 'timeout': 10.0, 'encoding': 'utf-8', 'keys': [], 'es_conn': <Elasticsearch([{'host': '[REDACTED]', 'port': [REDACTED], 'use_ssl': True, 'scheme': 'http'}])>}

It looks like it is a well known problem with the library that elasticsearch_loader is using: https://stackoverflow.com/questions/54454126/set-verify-certs-false-yet-elasticsearch-elasticsearch-throws-ssl-error-for-cert

Cheers!

"More than one type" error

Dear Moshe,
I am using elasticsearch_loader with the elastic version 7.1.1 on elastic cloud. When I add my own mapping to the index and then use the loader the following error is displayed:

"Rejecting mapping update to [myindex] as the final mapping would have more than 1 type: [_doc, json]', u'type': u'illegal_argument_exception'"

Tracing the issue I believe this change is the cause: https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch

I was able to go around it by commenting this line:

'_type': config['type'],

But I believe there might be better solution, which you might be able to propose.

Thank you for the helpful software!
Peio

Empty string in date field returns error

If I have an empty date in my CSV it causes the upload to crash.

failed to parse field [Book_Date] of type [date] in document with id 's97vIHABXxdv5IBd1aOp'

Field value

'Book_Date': '', 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.