Giter Site home page Giter Site logo

scrapy-elasticsearch's Introduction

Description

Scrapy pipeline which allows you to store scrapy items in Elastic Search.

Install

pip install ScrapyElasticSearch

If you need support for ntlm:
pip install "ScrapyElasticSearch[extras]"

Usage (Configure settings.py:)

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['localhost']
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'  # Custom unique key

# can also accept a list of fields if need a composite key
ELASTICSEARCH_UNIQ_KEY = ['url', 'id']

ELASTICSEARCH_SERVERS - list of hosts or string (single host). Host format: protocol://username:password@host:port.

Examples:

Available parameters (in settings.py)

 ELASTICSEARCH_INDEX - elastic search index
 ELASTICSEARCH_INDEX_DATE_FORMAT - the format for date suffix for the index, see python datetime.strftime for format. Default is no date suffix.
 ELASTICSEARCH_TYPE - elastic search type
 ELASTICSEARCH_UNIQ_KEY - optional field, unique key in string (must be a field or a list declared in model, see items.py)
 ELASTICSEARCH_BUFFER_LENGTH - optional field, number of items to be processed during each bulk insertion to Elasticsearch. Default size is 500.
 ELASTICSEARCH_AUTH  - optional field, set to 'NTLM' to use NTLM authentification
 ELASTICSEARCH_USERNAME - optional field, set to 'DOMAIN\username', only used with NLTM authentification
 ELASTICSEARCH_PASSWORD - optional field, set to your 'password', only used with NLTM authentification

 ELASTICSEARCH_CA - optional settings to if es servers require custom CA files.
 Example:
 ELASTICSEARCH_CA = {
      'CA_CERT': '/path/to/cacert.pem',
      'CLIENT_CERT': '/path/to/client_cert.pem',
      'CLIENT_KEY': '/path/to/client_key.pem'
}

Here is an example app (dirbot https://github.com/jayzeng/dirbot) in case you are still confused.

Dependencies

See requirements.txt

Changelog

  • 0.9: Accept custom CA cert to connect to es clusters
  • 0.8: Added support for NTLM authentification
  • 0.7.1: Added date format to the index name and a small bug fix
    • ELASTICSEARCH_BUFFER_LENGTH default was 9999, this has been changed to reflect documentation.
  • 0.7: A number of backwards incompatibility changes are introduced:
    • Changed ELASTICSEARCH_SERVER to ELASTICSEARCH_SERVERS
    • ELASTICSEARCH_SERVERS accepts string or list
    • ELASTICSEARCH_PORT is removed, you can specify it in the url
    • ELASTICSEARCH_USERNAME and ELASTICSEARCH_PASSWORD are removed. You can use this format ELASTICSEARCH_SERVERS=['http://username:password@host:port']
    • Changed scrapy.log to logging as scrapy now uses the logging module
  • 0.6.1: Able to pull configs from spiders (in addition to reading from config file)
  • 0.6: Bug fix
  • 0.5: Abilit to persist object; Option to specify logging level
  • 0.4: Remove debug
  • 0.3: Auth support
  • 0.2: Scrapy 0.18 support
  • 0.1: Initial release

Issues

If you find any bugs or have any questions, please report them to "issues" (https://github.com/knockrentals/scrapy-elasticsearch/issues)

Contributors

Licence

Copyright 2014 Michael Malocha

Expanded on the work by Julien Duponchelle

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

scrapy-elasticsearch's People

Contributors

alukach avatar andskli avatar aniketmaithani avatar denizdogan avatar ignaciovazquez avatar jayzeng avatar jenkin avatar jsgervais avatar julien-duponchelle avatar lljrsr avatar mjm159 avatar phrawzty avatar ppaci avatar sajattack avatar songzhiyong avatar tpeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy-elasticsearch's Issues

Time-based indices' name from scraped data

The parameter ELASTICSEARCH_INDEX_DATE_FORMAT sets index suffix from scraping timestamp to the specified format (ie. -%Y%m%d). But I need to set it to a string or a datetime (of a given format) taken from the scraped item. Here is a simple solution with two more parameters (ELASTICSEARCH_INDEX_DATE_KEY and ELASTICSEARCH_INDEX_DATE_KEY_FORMAT): jenkin@e834082.

'set' object has no attribute 'iteritems'

Something has gone wrong with my scrapy elasticsearch pipeline. If I leave the pipeline as active in my settings, it returns an AttributeError (see attached). However, if I comment the pipeline out, the script runs without issue. Thoughts?

set object error
set object error settings

Suggestion of setting '_index', '_source' and other parameters directly in parser

Hi,
I want to suggest to change the operation of the pipeline so that the items to be indexed are created by the user at the parser level and not via the parameters 'ELASTICSEARCH_INDEX' and 'ELASTICSEARCH_TYPE' by the pipeline.
Advantages:
-The user can specify different indices in elasticsearch for different parsers
-The user can control the '_op_type' setting of the bulk method to change for example from 'index' to 'update'

Cheers,
Julian

AttributeError: module 'types' has no attribute 'ListType'

After adding ElasticSearchPipeline to my ITEM_PIPELINES array I see this error:

Traceback (most recent call last):
File "/home/spl/Code/python_env/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/home/spl/Code/python_env/myenv/lib/python3.5/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 108, in process_item
if isinstance(item, types.GeneratorType) or isinstance(item, types.ListType):
AttributeError: module 'types' has no attribute 'ListType'

All involved packages are installed in the most recent versions.

BulkIndexError

BulkIndexError: (u'1 document(s) failed to index.', [{u'create': {u'status': 400, u'_type': u'jd_comment_test', u'_index': u'jd_comment-2018-05-26', u'error': {u'reason': u'Field [_id] is defined twice in [jd_comment_test]', u'type': u'illegal_argument_exception'}

How can I solve this problem?

Elasticsearch pipeline not enabled - Scrapy 1.3.3 / ES 5.2

Hi,

I’m trying to integrate Elasticsearch with Scrapy. I’ve followed the steps from https://github.com/knockrentals/scrapy-elasticsearch,
but it’s not loading the pipeline. Im using Scrapy 1.3.3 with Elasticsearch 5.2.

Logging:
INFO: Enabled item pipelines: []

My settings.py is as follows:

ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['http://172.17.0.2:9200']
ELASTICSEARCH_INDEX = 'scrapy'
#ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%A %d %B %Y'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url' # Custom uniqe key

Am I missing something? Do you need to define the Pipeline in pipelines.py?
The “dirbot” example didn’t.

ImportError: No module named requests_ntlm

@jayzeng Thanks for ur effort. I have done exactly what u instructed. installed ScrapyElasticSearch 0.8.3.
But now i get this error:
.....
File "C:\Miniconda2\lib\site-packages\scrapy\utils\misc.py", line 44, in load_
object
mod = import_module(module)
File "C:\Miniconda2\lib\importlib__init__.py", line 37, in import_module
import(name)
File "C:\Miniconda2\lib\site-packages\scrapyelasticsearch\scrapyelasticsearch.
py", line 21, in
from .transportNTLM import TransportNTLM
File "C:\Miniconda2\lib\site-packages\scrapyelasticsearch\transportNTLM.py", l
ine 7, in
from requests_ntlm import HttpNtlmAuth
ImportError: No module named requests_ntlm

Specifiy specific fields to index

I am also storing the raw html along with the items, but do not want to send that to ES index. Can we specify the specific fields which should be send to ES for indexeing

ImportError: No module named 'transportNTLM'

I successfully installed it according to the document https://pypi.python.org/pypi/ScrapyElasticSearch. However when i try

scrapy crawl myspider

in the Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32, i get this error:

File "E:\Python35\lib\site-packages\scrapyelasticsearch\scrapyelasticsearch.py", line 25, in <module> from transportNTLM import TransportNTLM ImportError: No module named 'transportNTLM'

I checked in the folder and the transportNTLM.py module is there.

Text field always gets ignore_above keyword

Hi,

Every time I save a text field to ES, the mapping has the following structure:

"text": {
                  "type": "text",
                  "fields": {
                     "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                     }
                  }
               }

Meaning I cannot search for any text that appears after 256 characters.

Is there any way of avoiding this? Thanks very much in advance!

Dynamically set on index/type?

Configuration for Elasticsearch index and type is done statically in settings.py. Is there a recommended approach to setting this during runtime, perhaps based on the item that is being piped?

How to handle deleted documents?

Example: I'm crawling an API with JSON documents. From time to time some documents get removed from the source database and are missing in the API. How to handle this with scrapy-elasticsearch to keep es up to date?

From the source code, I can see that there's no _op_type parameter specified in the bulk call so it probably resorts to default 'index'.

add index mapping

Hi!
How could I define a mapping for each field?
I want all my index fields to be indexed with mapping "not_analyzed" so i can get exacte values in search results.

Thank you,

Unique key is tuple if using items

If I am using items for parsing scrapy responses, the unique key when retrieved from the item is a tuple, and process_unique_key() will raise an exception. This can easily be fixed by changing line 94 in scrapyelasticsearch.py

From
if isinstance(unique_key, list):
To
if isinstance(unique_key, (list,tuple)):

got an unexpected keyword argument 'headers'

Hello, Im trying to insert data into Bonsai.io ES cloud and getting this error :

File "/usr/local/lib/python3.5/dist-packages/elasticsearch/client/init.py", line 1155, in bulk
headers={'content-type': 'application/x-ndjson'})
TypeError: perform_request() got an unexpected keyword argument 'headers'

How can I solve it?

Thanks

Removal of _type requirement

We have recently upgraded to elasticsearch 6.2.x which does not require a type, is there a way to remove the requirement for ELASTICSEARCH_TYPE in the code?

Add optional date suffix to index name

With Elasticsearch its common practice to add a date suffix to an index. If you index name is "test" there should we a way to automatically create monthly indexes (example: test-2016-6, test-2016-7, etc...)

I am willing to submit a Pull Request to this repo with the changes I made to my local copy, just need permission.

is it possible update item if this item with id exists?

Is it possible update item if this item with same id already exists in elastic instead of adding new one? I mean:
{
itemId: 1,
color: ['red', 'blue']
}
{
itemId: 1,
color: ['green']
}

result:
{
itemId: 1,
color: ['red', 'blue', 'green']
}

Scrapy logging show UnicodeDecodeError

Hello,

How can I solve issues related to encoding/decoding, below is the traceback:

Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python2.7/dist-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 159, in close_spider self.send_items() File "/usr/local/lib/python2.7/dist-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 146, in send_items helpers.bulk(self.es, self.items_buffer) File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/actions.py", line 304, in bulk for ok, item in streaming_bulk(client, actions, *args, **kwargs): File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/actions.py", line 216, in streaming_bulk actions, chunk_size, max_chunk_bytes, client.transport.serializer File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/actions.py", line 75, in _chunk_actions cur_size += len(data.encode("utf-8")) + 1 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 58: ordinal not in range(128)

Thanks,

Include S3 images field

I'm using S3 feed export to download images during scraping. I'm able to download the images to my S3 bucket but it happens after scrapy-elasticsearch has already completed. How can I index the response from S3 to include the image s3 url along with my item to be indexed?

Example:
item['thumbnail'] = 'https://s3-eu-west-1.amazonaws.com/image-url-response-from-s3

ELASTICSEARCH_UNIQ_KEY from multiple item fields

You can have an item without a single-field primary key, so this functionality is useless (or even dangerous!). Sure you can compute and add a new really unique field to the item, ie. from a concatenation of fields, but so it will be indexed along with the others. Maybe ELASTICSEARCH_UNIQ_KEY can accept a list of fields keys and concatenate their values (forced to strings) before the hash computing.

NTLM

I needed to do a :

pip install requests_ntlm

for this pipeline to work

How to insert data into an existing index ?

Hi there,

This plugin works great with the latest version of Scrapy (1.3) and Elasticsearch (5.1.1) on Ubuntu 16. Great work, Thanks.

There is a little problem. I have already setup an 'index' and 'mappings' in Elasticsearch. How do I configure this plugin to insert data in that existing index rather than creating a new one?

I did these settings ... (where 'news' index and type 'allNews' already exists). The following settings create a new index called "news-2017-01" and insert all the data in that index. I don't want that. I want this plugin to populate an already existing index. How do I do that?

ELASTICSEARCH_SERVERS = ['localhost']
ELASTICSEARCH_INDEX = 'news'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'allNews'
#ELASTICSEARCH_UNIQ_KEY = 'url' # Custom uniqe key

Please help.
Thanks

Add support for multiple ES nodes

In settings you can set ELASTICSEARCH_SERVER only by using a simple string. Pyes supports multiple hosts in ES initialization and this string is passed wrapped in a single-element list (see. L50 in scrapyelasticsearch.py).

It could be useful to set ELASTICSEARCH_SERVER also by using a list of hosts, maybe checking if the setting is actually a string or a list, ie via isistance().

Bulk indexing instead of single item indexing

Working in a pipeline, every item is indexed separately with many requests to ES (one per item). In addition, in some cases you want to break the pipeline concept, applying a global transformation to items before indexing (ie by overloading open_spider and close_spider methods in a pipeline class).

Using ES bulk api you can temporarily add items to an item buffer (with a length controlled by a setting) and then index them sometimes and not for every single item.

[SUGGESTION] Tackling MemoryError raised on bulk inserts

Good work on the extension. Appreciate the help being provided to the community. Though I haven't used your extension personally in production but the extension I wrote is very similar to the code you have.

I wanted to bring your attention to issues I faced personally which may improve your extension. When there long running scrapers (we have had scrapers run for 20 hrs sometimes) it is possible that your machine will run out of memory if all the items are appended to the items_buffer like so. My scrapers have failed to insert items after raising a MemoryError. The work around I use is to set a max_insert_limit_counter in the extension class and bulk insert items into ES after the max limit is hit. Probably something who uses this extension might find run into this issue in the future.

If you would like me to create a PR for this, let me know.

Python 3 issue with hashlib.sha1() for unique ID?

I am recently changing from python 2 to 3, not sure if this is a valid issue.

When I configured my ELASTICSEARCH_UNIQ_KEY value, I ran into a problem - if my unique ID is str, hashlib.sha1() complains:

TypeError: Unicode-objects must be encoded before hashing

If I .encode('utf-8') the ID before putting it in the field, line 73 in scrapyelasticsearch.py complains 'unique key must be str'

To work around it, I have to put the ID in a list!

What's the purpose of if isinstance(unique_key, list): in the def get_unique_key(self, unique_key) method?

Elasticsearch not receiving data from scrapy

I had setup scrapy on my local machine with CrawlSpider to index a local static html site. So far so good, I get a valid json file as output.
Next I installed ScrapyElasticSearch (currently, configured settings.py with correct ITEM_PIPELINES and ran scrapy crawl on my site.
If I look at the logs, I get this:

2017-08-01 13:25:56 [root] DEBUG: Generated unique key bbd9eba5e56d510757eb42eed3b130520b7b1958
2017-08-01 13:25:56 [root] DEBUG: Item sent to Elastic Search scrapy

But when I look at my Elasticsearch server, no data has been entered. Even worse, I can just shutdown my Elasticsearch engine and the log entry will still say the same. So no error message is thrown.
I've also tested this in a clean vagrant machine with virtualenv enabled, but the problem is the same. I tried logging network traffic with tcpdump, not a single byte is passed. I have no clue what I did wrong, other than that something is broken.

Below is my pip list:

argparse (1.2.1)
asn1crypto (0.22.0)
attrs (17.2.0)
Automat (0.6.0)
cffi (1.10.0)
constantly (15.1.0)
cryptography (2.0.2)
cssselect (1.0.1)
elasticsearch (5.4.0)
enum34 (1.1.6)
hyperlink (17.3.0)
idna (2.5)
incremental (17.5.0)
ipaddress (1.0.18)
lxml (3.8.0)
parsel (1.2.0)
pip (1.5.6)
pyasn1 (0.3.1)
pyasn1-modules (0.0.10)
pycparser (2.18)
PyDispatcher (2.0.5)
pyOpenSSL (17.2.0)
queuelib (1.4.2)
Scrapy (1.4.0)
ScrapyElasticSearch (0.8.9)
service-identity (17.0.0)
setuptools (5.5.1)
six (1.10.0)
Twisted (17.5.0)
urllib3 (1.22)
w3lib (1.17.0)
wsgiref (0.1.2)
zope.interface (4.4.2)

My relevant settings.py entries:

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 100
}

ELASTICSEARCH_SERVERS = ['http://127.0.0.1:9200']
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'

Thank you for your help

ELASTICSEARCH_BUFFER_LENGTH

I use scrapy-redis, my spider is waiting for input from redis queue.
If i send less urls than the buffer_length they wont be ever pushed into elasticsearch.

Do you have any workarounds?

Lots of items missing in kibana when using elasticsearch pipeline, but available in csv export and another relational pipeline

I would like to know how to debug my situation: I have a postgresql pipeline that's working flawlessly, adding 2k items to my relational database when I run scrapy. I've installed scrapy-elasticsearch as well to be able to use elasticsearch along my postgresql, but after scraping when I get into kibana I have... 36 items. My index is the day the item was scraped, and even selecting "years ago" in kibana interface I only get 36 hits.

How and where do I debug to check where are things going wrong?

Content-Type required, Elasticsearch 6.x

Hi,

I'm testing this plugin with the new Elasticsearch 6.x. version.
The Content-Type (json) is now required. I get following error:

[elasticsearch] DEBUG: < {"error":"Content-Type header [] is not supported","status":406}

Is there a way to set the content-type to json ?

Thanks !

Missing header information for ElasticSearch 6.2

Thank you all for putting together this great tool. I was thrilled to find this.

I am currently getting an error as follows:

{"error":"Content-Type header [] is not supported","status":406}

According to this URL, elasticsearch-dump/elasticsearch-dump#350 , additional headers need to be passed for ElasticSearch 6.x. As follows:

-headers='{"Content-Type": "application/json"}

Could this be a new configuration added?

TypeError: sha1() argument 1 must be string or buffer, not list

Hi,

Have the following issue when running the following spider before its added to ES.

The ES Key is set as "link".

Any help would be greatly appreciated.

import scrapy
import uuid

from compass.items import CompassItem

class DarkReadingSpider(scrapy.Spider):
    name = "darkreading"
    allowed_domains = ["darkreading.com"]
    start_urls = (
        'http://www.darkreading.com/rss_simple.asp',
    )

    def parse(self, response):
        for sel in response.xpath('//item'):
                item = CompassItem()
                item['id'] = uuid.uuid4()
                item['title'] = sel.xpath('title/text()').extract()
                item['link'] = sel.xpath('link/text()').extract()
                item['desc'] = sel.xpath('description/text()').extract()
                print item
                yield item

Output/Error:

{'desc': [u'Two-thirds of IT security professionals say that network security has become more difficult over the last two years with growing complexity in managing heterogeneous network environments.'],
'id': UUID('0112e36e-50ce-4660-9072-da2a5fec09e6'),
'link': [u'http://www.darkreading.com/cloud/survey-shows-cloud-infrastructure-security-a-major-challenge-/d/d-id/1324872?_mc=RSS_DR_EDT'],
'title': [u'Survey Shows Cloud Infrastructure Security A Major Challenge ']}
2016-04-01 15:15:34 [scrapy] ERROR: Error processing {'desc': [u'Two-thirds of IT security professionals say that network security has become more difficult over the last two years with growing complexity in managing heterogeneous network environments.'],
'id': UUID('0112e36e-50ce-4660-9072-da2a5fec09e6'),
'link': [u'http://www.darkreading.com/cloud/survey-shows-cloud-infrastructure-security-a-major-challenge-/d/d-id/1324872?_mc=RSS_DR_EDT'],
'title': [u'Survey Shows Cloud Infrastructure Security A Major Challenge ']}
Traceback (most recent call last):
File "/usr/local/lib64/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item
self.index_item(item)
File "/usr/local/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item
local_id = hashlib.sha1(item[uniq_key]).hexdigest()
TypeError: sha1() argument 1 must be string or buffer, not list

What does 'ELASTICSEARCH_UNIQ_KEY' do?

Is it for adding _id value? I added some documents and it didn't seem like that.. If so, is there any way to add id value? I want to use scraped urls as id.

Pipeline sending data after spider_closed and opened again.

I am using scrapy-redis.
My spider is RedisSpider
(from docs)
The class scrapy_redis.spiders.RedisSpider enables a spider to read the urls from redis. The urls in the redis queue will be processed one after another, if the first request yields more requests, the spider will process those requests before fetching another url from redis.

If i am right: scrapy-elasticsearch sends data to elasticsearch after number of items >= ELASTICSEARCH_BUFFER_LENGTH setting.

if len(self.items_buffer) >= self.settings.get('ELASTICSEARCH_BUFFER_LENGTH', 500): self.send_items()   self.items_buffer = []

RedisSpider is waiting when idle, so if we send 600 urls to redis and our ELASTICSEARCH_BUFFER_LENGTH is 500, 100 urls wont be saved. Beacuse RedisSpider never closes.

So i overrided spider_idle method:

Now spider closes when its idle. It works.

But using this code
I keep running the spider in loop that never ends. So when it closes it runs again.
If there are urls in redis queue they are crawled. Spider is closed, data is send to elasticsearch and spider restarts.
It works, but now the loop:
[]

  1. Spider starts.
  2. Spider reads urls from redis queue.
  3. Spider parsing...
  4. Spider finished. (last chunk of data sent to elasticsearch)
  5. Spider started.
    6. Data is being sent to elasticsearch... ?
    And the loop continues...
  6. Spider starts...
  7. Spider reads urls from redis queue.
    3...
    []

Here's log of the loop:
https://gist.github.com/pythoncontrol/4e88f5de253ca406b24885af0b4673fd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.