jayzeng / scrapy-elasticsearch Goto Github PK

View Code? Open in Web Editor NEW

This project forked from julien-duponchelle/scrapy-elasticsearch

327.0 37.0 89.0 81 KB

A scrapy pipeline which send items to Elastic Search server

Python 100.00%

scrapy-elasticsearch's Introduction

Description

Scrapy pipeline which allows you to store scrapy items in Elastic Search.

Install

pip install ScrapyElasticSearch

If you need support for ntlm:
pip install "ScrapyElasticSearch[extras]"

Usage (Configure settings.py:)

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['localhost']
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'  # Custom unique key

# can also accept a list of fields if need a composite key
ELASTICSEARCH_UNIQ_KEY = ['url', 'id']

ELASTICSEARCH_SERVERS - list of hosts or string (single host). Host format: protocol://username:password@host:port.

Examples:

Available parameters (in settings.py)

 ELASTICSEARCH_INDEX - elastic search index
 ELASTICSEARCH_INDEX_DATE_FORMAT - the format for date suffix for the index, see python datetime.strftime for format. Default is no date suffix.
 ELASTICSEARCH_TYPE - elastic search type
 ELASTICSEARCH_UNIQ_KEY - optional field, unique key in string (must be a field or a list declared in model, see items.py)
 ELASTICSEARCH_BUFFER_LENGTH - optional field, number of items to be processed during each bulk insertion to Elasticsearch. Default size is 500.
 ELASTICSEARCH_AUTH  - optional field, set to 'NTLM' to use NTLM authentification
 ELASTICSEARCH_USERNAME - optional field, set to 'DOMAIN\username', only used with NLTM authentification
 ELASTICSEARCH_PASSWORD - optional field, set to your 'password', only used with NLTM authentification

 ELASTICSEARCH_CA - optional settings to if es servers require custom CA files.
 Example:
 ELASTICSEARCH_CA = {
      'CA_CERT': '/path/to/cacert.pem',
      'CLIENT_CERT': '/path/to/client_cert.pem',
      'CLIENT_KEY': '/path/to/client_key.pem'
}

Here is an example app (dirbot https://github.com/jayzeng/dirbot) in case you are still confused.

Dependencies

See requirements.txt

Changelog

0.9: Accept custom CA cert to connect to es clusters
0.8: Added support for NTLM authentification
0.7.1: Added date format to the index name and a small bug fix
- ELASTICSEARCH_BUFFER_LENGTH default was 9999, this has been changed to reflect documentation.
0.7: A number of backwards incompatibility changes are introduced:
- Changed ELASTICSEARCH_SERVER to ELASTICSEARCH_SERVERS
- ELASTICSEARCH_SERVERS accepts string or list
- ELASTICSEARCH_PORT is removed, you can specify it in the url
- ELASTICSEARCH_USERNAME and ELASTICSEARCH_PASSWORD are removed. You can use this format ELASTICSEARCH_SERVERS=['http://username:password@host:port']
- Changed scrapy.log to logging as scrapy now uses the logging module
0.6.1: Able to pull configs from spiders (in addition to reading from config file)
0.6: Bug fix
0.5: Abilit to persist object; Option to specify logging level
0.4: Remove debug
0.3: Auth support
0.2: Scrapy 0.18 support
0.1: Initial release

Issues

If you find any bugs or have any questions, please report them to "issues" (https://github.com/knockrentals/scrapy-elasticsearch/issues)

Contributors

Jay Zeng (Maintainer) (https://github.com/jayzeng)
Michael Malocha (https://github.com/mjm159)
Ignacio Vazquez (https://github.com/ignaciovazquez)
Julien Duponchelle (https://github.com/noplay)
Jay Stewart (https://github.com/solidground)
Alessio Cimarelli (https://github.com/jenkin)
Doug Parker (https://github.com/dougiep16)
Jean-Sebastien Gervais (https://github.com/jsgervais)

Licence

Expanded on the work by Julien Duponchelle

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

scrapy-elasticsearch's People

Contributors

Stargazers

Watchers

Forkers

mavencode01 atassumer songzhiyong ignaciovazquez glynnallen1704 rverbitsky arezki1990 moritzschaefer aniketmaithani wwwangcai salmanwahed cyb3r lljrsr zsmj513 gerosalesc updiversity clojj jenkin warungdata mrphishxxx chengxiayan jonathanbowker jsgervais dougiep16 djangocreation deepminder orhan89 denizdogan tpeng radiofrequency ploncker zergey coll3ctions suraj-arya urandu slideclick danizen algotrader-dotcom rgaidot rmb938 sgorbaty taek ipsolar yuseferi aleroot etiwari dameyerdave platbr kota999 wed3nsday podolskyi zub0r init-object ibanez32 kgulpinar ajocelynpatrick jainaayush05 ross-considine misssprite asiellb sitecrafting danielamaya chensian datafields-team diegov hsali pyscrape zanachka tsungming filipecaixeta garretwu sumerzhang julfes zygimantass wildgarden napoler andersoncarubelli junmasui yinzhigang simahawk jeppy thedraketaylor ant-nomad zack-wilson andrewpedia miettal hassan-alexandre-innodataweb misslio joseabraham

scrapy-elasticsearch's Issues

Time-based indices' name from scraped data

The parameter ELASTICSEARCH_INDEX_DATE_FORMAT sets index suffix from scraping timestamp to the specified format (ie. -%Y%m%d). But I need to set it to a string or a datetime (of a given format) taken from the scraped item. Here is a simple solution with two more parameters (ELASTICSEARCH_INDEX_DATE_KEY and ELASTICSEARCH_INDEX_DATE_KEY_FORMAT): jenkin@e834082.

Unable to post to ssl endpoint with custom CA

We have a elastic stood up using a internal CA. This plugin does not like that. Is there a way where we can pass in the cert, or have it ignore it?

'set' object has no attribute 'iteritems'

Something has gone wrong with my scrapy elasticsearch pipeline. If I leave the pipeline as active in my settings, it returns an AttributeError (see attached). However, if I comment the pipeline out, the script runs without issue. Thoughts?

Suggestion of setting '_index', '_source' and other parameters directly in parser

Hi,
I want to suggest to change the operation of the pipeline so that the items to be indexed are created by the user at the parser level and not via the parameters 'ELASTICSEARCH_INDEX' and 'ELASTICSEARCH_TYPE' by the pipeline.
Advantages:
-The user can specify different indices in elasticsearch for different parsers
-The user can control the '_op_type' setting of the bulk method to change for example from 'index' to 'update'

Cheers,
Julian

AttributeError: module 'types' has no attribute 'ListType'

After adding ElasticSearchPipeline to my ITEM_PIPELINES array I see this error:

Traceback (most recent call last):
File "/home/spl/Code/python_env/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/home/spl/Code/python_env/myenv/lib/python3.5/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 108, in process_item
if isinstance(item, types.GeneratorType) or isinstance(item, types.ListType):
AttributeError: module 'types' has no attribute 'ListType'

All involved packages are installed in the most recent versions.

BulkIndexError

BulkIndexError: (u'1 document(s) failed to index.', [{u'create': {u'status': 400, u'_type': u'jd_comment_test', u'_index': u'jd_comment-2018-05-26', u'error': {u'reason': u'Field [_id] is defined twice in [jd_comment_test]', u'type': u'illegal_argument_exception'}

How can I solve this problem?

Elasticsearch pipeline not enabled - Scrapy 1.3.3 / ES 5.2

Hi,

I’m trying to integrate Elasticsearch with Scrapy. I’ve followed the steps from https://github.com/knockrentals/scrapy-elasticsearch,
but it’s not loading the pipeline. Im using Scrapy 1.3.3 with Elasticsearch 5.2.

Logging:
INFO: Enabled item pipelines: []

My settings.py is as follows:

ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['http://172.17.0.2:9200']
ELASTICSEARCH_INDEX = 'scrapy'
#ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%A %d %B %Y'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url' # Custom uniqe key
—

Am I missing something? Do you need to define the Pipeline in pipelines.py?
The “dirbot” example didn’t.

ImportError: No module named requests_ntlm

@jayzeng Thanks for ur effort. I have done exactly what u instructed. installed ScrapyElasticSearch 0.8.3.
But now i get this error:
.....
File "C:\Miniconda2\lib\site-packages\scrapy\utils\misc.py", line 44, in load_
object
mod = import_module(module)
File "C:\Miniconda2\lib\importlib__init__.py", line 37, in import_module
import(name)
File "C:\Miniconda2\lib\site-packages\scrapyelasticsearch\scrapyelasticsearch.
py", line 21, in
from .transportNTLM import TransportNTLM
File "C:\Miniconda2\lib\site-packages\scrapyelasticsearch\transportNTLM.py", l
ine 7, in
from requests_ntlm import HttpNtlmAuth
ImportError: No module named requests_ntlm

Specifiy specific fields to index

I am also storing the raw html along with the items, but do not want to send that to ES index. Can we specify the specific fields which should be send to ES for indexeing

NTLM authentication adds a dependency even if you don't need it

It's the pip package requests_ntlm. I'm using Python 2.7.3 on Debian 7.11. Maybe you can import the ntlm module only if the ELASTICSEARCH_AUTH parameters has NTLM value?

ImportError: No module named 'transportNTLM'

I successfully installed it according to the document https://pypi.python.org/pypi/ScrapyElasticSearch. However when i try

scrapy crawl myspider

in the Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32, i get this error:

File "E:\Python35\lib\site-packages\scrapyelasticsearch\scrapyelasticsearch.py", line 25, in <module> from transportNTLM import TransportNTLM ImportError: No module named 'transportNTLM'

I checked in the folder and the transportNTLM.py module is there.

Text field always gets ignore_above keyword

Hi,

Every time I save a text field to ES, the mapping has the following structure:

"text": {
                  "type": "text",
                  "fields": {
                     "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                     }
                  }
               }

Meaning I cannot search for any text that appears after 256 characters.

Is there any way of avoiding this? Thanks very much in advance!

Dynamically set on index/type?

Configuration for Elasticsearch index and type is done statically in settings.py. Is there a recommended approach to setting this during runtime, perhaps based on the item that is being piped?

How to handle deleted documents?

Example: I'm crawling an API with JSON documents. From time to time some documents get removed from the source database and are missing in the API. How to handle this with scrapy-elasticsearch to keep es up to date?

From the source code, I can see that there's no _op_type parameter specified in the bulk call so it probably resorts to default 'index'.

add index mapping

Hi!
How could I define a mapping for each field?
I want all my index fields to be indexed with mapping "not_analyzed" so i can get exacte values in search results.

Thank you,

ElasticSearch on AWS

Hi there, I am using ElasticSearch in AWS and there authentification is different.
I added my own pipeline for this, maybe you want to have a look:
https://github.com/philippbussche/scrapy-tooling/tree/master/src/elasticsearchAWSpipeline
Maybe we want to incorporate this into the official pipeline ?
Disclaimer: I am using kind of an old version of Scrapy so obviously I would have to change things.

Unique key is tuple if using items

If I am using items for parsing scrapy responses, the unique key when retrieved from the item is a tuple, and process_unique_key() will raise an exception. This can easily be fixed by changing line 94 in scrapyelasticsearch.py

From
if isinstance(unique_key, list):
To
if isinstance(unique_key, (list,tuple)):

deleted

meh ignore =p

got an unexpected keyword argument 'headers'

Hello, Im trying to insert data into Bonsai.io ES cloud and getting this error :

File "/usr/local/lib/python3.5/dist-packages/elasticsearch/client/init.py", line 1155, in bulk
headers={'content-type': 'application/x-ndjson'})
TypeError: perform_request() got an unexpected keyword argument 'headers'

How can I solve it?

Thanks

Removal of _type requirement

We have recently upgraded to elasticsearch 6.2.x which does not require a type, is there a way to remove the requirement for ELASTICSEARCH_TYPE in the code?

Suggest making item_id = hashlib.sha1(unique_key).hexdigest() optional

Suggest making item_id = hashlib.sha1(unique_key).hexdigest() in def get_id optional so users can set the elasticsearch _id without encoding.

Add optional date suffix to index name

With Elasticsearch its common practice to add a date suffix to an index. If you index name is "test" there should we a way to automatically create monthly indexes (example: test-2016-6, test-2016-7, etc...)

I am willing to submit a Pull Request to this repo with the changes I made to my local copy, just need permission.

is it possible update item if this item with id exists?

Is it possible update item if this item with same id already exists in elastic instead of adding new one? I mean:
{
itemId: 1,
color: ['red', 'blue']
}
{
itemId: 1,
color: ['green']
}

result:
{
itemId: 1,
color: ['red', 'blue', 'green']
}

Scrapy logging show UnicodeDecodeError

Hello,

How can I solve issues related to encoding/decoding, below is the traceback:

Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python2.7/dist-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 159, in close_spider self.send_items() File "/usr/local/lib/python2.7/dist-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 146, in send_items helpers.bulk(self.es, self.items_buffer) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/actions.py", line 304, in bulk for ok, item in streaming_bulk(client, actions, *args, **kwargs): File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/actions.py", line 216, in streaming_bulk actions, chunk_size, max_chunk_bytes, client.transport.serializer File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/actions.py", line 75, in _chunk_actions cur_size += len(data.encode("utf-8")) + 1 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 58: ordinal not in range(128)

Thanks,

Include S3 images field

I'm using S3 feed export to download images during scraping. I'm able to download the images to my S3 bucket but it happens after scrapy-elasticsearch has already completed. How can I index the response from S3 to include the image s3 url along with my item to be indexed?

Example:
item['thumbnail'] = 'https://s3-eu-west-1.amazonaws.com/image-url-response-from-s3

ELASTICSEARCH_UNIQ_KEY from multiple item fields

You can have an item without a single-field primary key, so this functionality is useless (or even dangerous!). Sure you can compute and add a new really unique field to the item, ie. from a concatenation of fields, but so it will be indexed along with the others. Maybe ELASTICSEARCH_UNIQ_KEY can accept a list of fields keys and concatenate their values (forced to strings) before the hash computing.

NTLM

I needed to do a :

pip install requests_ntlm

for this pipeline to work

getting an error: exceptions.KeyError: 'url'

How to insert data into an existing index ?

Hi there,

This plugin works great with the latest version of Scrapy (1.3) and Elasticsearch (5.1.1) on Ubuntu 16. Great work, Thanks.

There is a little problem. I have already setup an 'index' and 'mappings' in Elasticsearch. How do I configure this plugin to insert data in that existing index rather than creating a new one?

I did these settings ... (where 'news' index and type 'allNews' already exists). The following settings create a new index called "news-2017-01" and insert all the data in that index. I don't want that. I want this plugin to populate an already existing index. How do I do that?

ELASTICSEARCH_SERVERS = ['localhost']
ELASTICSEARCH_INDEX = 'news'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'allNews'
#ELASTICSEARCH_UNIQ_KEY = 'url' # Custom uniqe key

Please help.
Thanks

Add support for multiple ES nodes

In settings you can set ELASTICSEARCH_SERVER only by using a simple string. Pyes supports multiple hosts in ES initialization and this string is passed wrapped in a single-element list (see. L50 in scrapyelasticsearch.py).

It could be useful to set ELASTICSEARCH_SERVER also by using a list of hosts, maybe checking if the setting is actually a string or a list, ie via isistance().

Ability to specify ingest pipeline as a query parameter

I can't see how I can specify an ingest pipeline on the elasticsearch bulk request:

https://www.elastic.co/guide/en/elasticsearch/reference/5.2/ingest.html

Longer term fix would be to be able to pass these settings as variables from the scrapy settings.py

A shorter term fix would be letting everyone know in the documentation which methods can be overridden to generate this behaviour.

migrate to other elasticsearch client library

I think pyes won't be migrating to elasticsearch 2.x any time soon so the only serious alternative I see is elasticsearch-py

Location of scrapyelasticsearch.py file

Hi, I need to tweak this file "scrapyelasticsearch.py". Not sure where its located. My OS is Ubuntu 16.04. Can anyone please help ?

Thanks

Bulk indexing instead of single item indexing

Working in a pipeline, every item is indexed separately with many requests to ES (one per item). In addition, in some cases you want to break the pipeline concept, applying a global transformation to items before indexing (ie by overloading open_spider and close_spider methods in a pipeline class).

Using ES bulk api you can temporarily add items to an item buffer (with a length controlled by a setting) and then index them sometimes and not for every single item.

[SUGGESTION] Tackling MemoryError raised on bulk inserts

Good work on the extension. Appreciate the help being provided to the community. Though I haven't used your extension personally in production but the extension I wrote is very similar to the code you have.

I wanted to bring your attention to issues I faced personally which may improve your extension. When there long running scrapers (we have had scrapers run for 20 hrs sometimes) it is possible that your machine will run out of memory if all the items are appended to the items_buffer like so. My scrapers have failed to insert items after raising a MemoryError. The work around I use is to set a max_insert_limit_counter in the extension class and bulk insert items into ES after the max limit is hit. Probably something who uses this extension might find run into this issue in the future.

If you would like me to create a PR for this, let me know.

Usage documentation outdated

Usage (Configure settings.py:)

ITEM_PIPELINES = [
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
]
seems to be deprecated in newer scrapy versions

now use
ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 100,
}
(the number defines the order of the Pipelines, if you have more than one)
see: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#activating-an-item-pipeline-component

NTLM Authentification with elasticsearch server

Is there any way to provide NTLM credentials to the elasticsearch server ?

ELASTICSEARCH_SERVERS not really required

Elasticsearch has two default values for host and port: localhost and 9200. So the ELASTICSEARCH_SERVERS paramater is not really required, because you can assume the previous default values. See jenkin@e834082.

Elasticsearch 5 has maybe more reason to allow the creation of mapping

Maybe it would be possible to ingest a mapping into elasticsearch before writing items? E.g.

ELASTICSEARCH_MAPPING = { ... some json ...}

ELASTICSEARCH_MAPPING = "/file/to/mapping.json"

It'd be a nice flow to have it be included at this stage.

Python 3 issue with hashlib.sha1() for unique ID?

I am recently changing from python 2 to 3, not sure if this is a valid issue.

When I configured my ELASTICSEARCH_UNIQ_KEY value, I ran into a problem - if my unique ID is str, hashlib.sha1() complains:

TypeError: Unicode-objects must be encoded before hashing

If I .encode('utf-8') the ID before putting it in the field, line 73 in scrapyelasticsearch.py complains 'unique key must be str'

To work around it, I have to put the ID in a list!

What's the purpose of if isinstance(unique_key, list): in the def get_unique_key(self, unique_key) method?

Elasticsearch not receiving data from scrapy

I had setup scrapy on my local machine with CrawlSpider to index a local static html site. So far so good, I get a valid json file as output.
Next I installed ScrapyElasticSearch (currently, configured settings.py with correct ITEM_PIPELINES and ran scrapy crawl on my site.
If I look at the logs, I get this:

2017-08-01 13:25:56 [root] DEBUG: Generated unique key bbd9eba5e56d510757eb42eed3b130520b7b1958
2017-08-01 13:25:56 [root] DEBUG: Item sent to Elastic Search scrapy

But when I look at my Elasticsearch server, no data has been entered. Even worse, I can just shutdown my Elasticsearch engine and the log entry will still say the same. So no error message is thrown.
I've also tested this in a clean vagrant machine with virtualenv enabled, but the problem is the same. I tried logging network traffic with tcpdump, not a single byte is passed. I have no clue what I did wrong, other than that something is broken.

Below is my pip list:

argparse (1.2.1)
asn1crypto (0.22.0)
attrs (17.2.0)
Automat (0.6.0)
cffi (1.10.0)
constantly (15.1.0)
cryptography (2.0.2)
cssselect (1.0.1)
elasticsearch (5.4.0)
enum34 (1.1.6)
hyperlink (17.3.0)
idna (2.5)
incremental (17.5.0)
ipaddress (1.0.18)
lxml (3.8.0)
parsel (1.2.0)
pip (1.5.6)
pyasn1 (0.3.1)
pyasn1-modules (0.0.10)
pycparser (2.18)
PyDispatcher (2.0.5)
pyOpenSSL (17.2.0)
queuelib (1.4.2)
Scrapy (1.4.0)
ScrapyElasticSearch (0.8.9)
service-identity (17.0.0)
setuptools (5.5.1)
six (1.10.0)
Twisted (17.5.0)
urllib3 (1.22)
w3lib (1.17.0)
wsgiref (0.1.2)
zope.interface (4.4.2)

My relevant settings.py entries:

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 100
}

ELASTICSEARCH_SERVERS = ['http://127.0.0.1:9200']
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'

Thank you for your help

ELASTICSEARCH_BUFFER_LENGTH

I use scrapy-redis, my spider is waiting for input from redis queue.
If i send less urls than the buffer_length they wont be ever pushed into elasticsearch.

Do you have any workarounds?

flush/commit every ELASTICSEARCH_FLUSH_SIZE items, and flush at end.

I'll work on this; pretty basic - expect pull request. If there's a way and I've missed it, let me know.

Lots of items missing in kibana when using elasticsearch pipeline, but available in csv export and another relational pipeline

I would like to know how to debug my situation: I have a postgresql pipeline that's working flawlessly, adding 2k items to my relational database when I run scrapy. I've installed scrapy-elasticsearch as well to be able to use elasticsearch along my postgresql, but after scraping when I get into kibana I have... 36 items. My index is the day the item was scraped, and even selecting "years ago" in kibana interface I only get 36 hits.

How and where do I debug to check where are things going wrong?

Content-Type required, Elasticsearch 6.x

Hi,

I'm testing this plugin with the new Elasticsearch 6.x. version.
The Content-Type (json) is now required. I get following error:

[elasticsearch] DEBUG: < {"error":"Content-Type header [] is not supported","status":406}

Is there a way to set the content-type to json ?

Thanks !

Missing header information for ElasticSearch 6.2

Thank you all for putting together this great tool. I was thrilled to find this.

I am currently getting an error as follows:

{"error":"Content-Type header [] is not supported","status":406}

According to this URL, elasticsearch-dump/elasticsearch-dump#350 , additional headers need to be passed for ElasticSearch 6.x. As follows:

-headers='{"Content-Type": "application/json"}

Could this be a new configuration added?

Delete data from ES index before inserting data again ?

As the title says, Is there any inbuilt command in this plugin, that clears existing data in ES before populating it again ? Thanks

TypeError: sha1() argument 1 must be string or buffer, not list

Hi,

Have the following issue when running the following spider before its added to ES.

The ES Key is set as "link".

Any help would be greatly appreciated.

import scrapy
import uuid

from compass.items import CompassItem

class DarkReadingSpider(scrapy.Spider):
    name = "darkreading"
    allowed_domains = ["darkreading.com"]
    start_urls = (
        'http://www.darkreading.com/rss_simple.asp',
    )

    def parse(self, response):
        for sel in response.xpath('//item'):
                item = CompassItem()
                item['id'] = uuid.uuid4()
                item['title'] = sel.xpath('title/text()').extract()
                item['link'] = sel.xpath('link/text()').extract()
                item['desc'] = sel.xpath('description/text()').extract()
                print item
                yield item

Output/Error:

{'desc': [u'Two-thirds of IT security professionals say that network security has become more difficult over the last two years with growing complexity in managing heterogeneous network environments.'],
'id': UUID('0112e36e-50ce-4660-9072-da2a5fec09e6'),
'link': [u'http://www.darkreading.com/cloud/survey-shows-cloud-infrastructure-security-a-major-challenge-/d/d-id/1324872?_mc=RSS_DR_EDT'],
'title': [u'Survey Shows Cloud Infrastructure Security A Major Challenge ']}
2016-04-01 15:15:34 [scrapy] ERROR: Error processing {'desc': [u'Two-thirds of IT security professionals say that network security has become more difficult over the last two years with growing complexity in managing heterogeneous network environments.'],
'id': UUID('0112e36e-50ce-4660-9072-da2a5fec09e6'),
'link': [u'http://www.darkreading.com/cloud/survey-shows-cloud-infrastructure-security-a-major-challenge-/d/d-id/1324872?_mc=RSS_DR_EDT'],
'title': [u'Survey Shows Cloud Infrastructure Security A Major Challenge ']}
Traceback (most recent call last):
File "/usr/local/lib64/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item
self.index_item(item)
File "/usr/local/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item
local_id = hashlib.sha1(item[uniq_key]).hexdigest()
TypeError: sha1() argument 1 must be string or buffer, not list

What does 'ELASTICSEARCH_UNIQ_KEY' do?

Is it for adding _id value? I added some documents and it didn't seem like that.. If so, is there any way to add id value? I want to use scraped urls as id.

Pipeline sending data after spider_closed and opened again.

I am using scrapy-redis.
My spider is RedisSpider
(from docs)
The class scrapy_redis.spiders.RedisSpider enables a spider to read the urls from redis. The urls in the redis queue will be processed one after another, if the first request yields more requests, the spider will process those requests before fetching another url from redis.

If i am right: scrapy-elasticsearch sends data to elasticsearch after number of items >= ELASTICSEARCH_BUFFER_LENGTH setting.

if len(self.items_buffer) >= self.settings.get('ELASTICSEARCH_BUFFER_LENGTH', 500): self.send_items() self.items_buffer = []

RedisSpider is waiting when idle, so if we send 600 urls to redis and our ELASTICSEARCH_BUFFER_LENGTH is 500, 100 urls wont be saved. Beacuse RedisSpider never closes.

So i overrided spider_idle method:

Now spider closes when its idle. It works.

But using this code
I keep running the spider in loop that never ends. So when it closes it runs again.
If there are urls in redis queue they are crawled. Spider is closed, data is send to elasticsearch and spider restarts.
It works, but now the loop:
[]

Spider starts.
Spider reads urls from redis queue.
Spider parsing...
Spider finished. (last chunk of data sent to elasticsearch)
Spider started.
6. Data is being sent to elasticsearch... ?
And the loop continues...
Spider starts...
Spider reads urls from redis queue.
3...
[]

Here's log of the loop:
https://gist.github.com/pythoncontrol/4e88f5de253ca406b24885af0b4673fd

jayzeng / scrapy-elasticsearch Goto Github PK

scrapy-elasticsearch's Introduction

Description

Install

Usage (Configure settings.py:)

Available parameters (in settings.py)

Dependencies

Changelog

Issues

Contributors

Licence

scrapy-elasticsearch's People

Contributors

Stargazers

Watchers

Forkers

scrapy-elasticsearch's Issues

My settings.py is as follows:

Recommend Projects

Recommend Topics

Recommend Org