Comments (16)
any updates to speeding this up?
I regularly have to do temporary log analysis by ingesting logs from s3 and the longer it takes to ingest the more money it costs and the angrier people get waiting for all their data to be ingested to analyze.
from logstash-input-s3.
I want to send log of ELB to s3 bucket. ELB logs of different service will be in different directory of my main log bucket. When I tried to put that in my input s3 conf, I am not getting any log
Here is my s3 input conf file:
input {
s3 {
bucket => "production-logs"
region => "us-east-1"
prefix => "elb/"
type => "elb"
sincedb_path => "log_sincedb"
}
}
But If I set a name of filepath as prefix then I can view the log in Kibana. (example: elb/production-XXXX/AWSLogs/XXXXXX/elasticloadbalancing/us-east-1/2016/02/24/). But I want to send log from all subdirectory of my bucket
from logstash-input-s3.
It looks like the plugin loops through every object in the bucket before processing them. As you add objects, this list grows and it takes longer to loop through. I need to do a bit more testing on this theory - but in processing a ton of files (> 80K) that's what I seemed to see. I can't quite tell if it queues them all up, or if it is processing them while looping through them. Amazon's API calls limit you to 1K objects at a time, but some of the libraries abstract this and add paging, such that a loop will go through everything.
It would be nice to have the option of limiting how many objects it processes at a time.
from logstash-input-s3.
https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L104
list_new_files runs through and looks for keys that match the prefix and dont match the excludes, storing things in the sincedb. If you move objects to another bucket or prefix after they have been processed, this should speed up run times as the list to run through and check would be much smaller. Not a solution, but a workaround at least
from logstash-input-s3.
The root problem is that in ruby-aws-sdk, if you iterate through a bucket, checking thing.last_modified
does a round-trip to the aws api. The ListObjects aws api call does return the last modified date for every object, but apparently the aws sdk forgets this information and rerequests it every time. This means the s3 input is making an api call for every object in the bucket (matching the prefix), which is insanely slow.
from logstash-input-s3.
@lexelby I didn't know aws-sdk was doing a round trip when requesting the last_modified
information. I'll check how I can improve that part and also boost the performance of this method.
Concerning adding the proxy support, this is an easy fix to add the option on our base aws mixin https://github.com/logstash-plugins/logstash-mixin-aws .
I've looked rapidly at the aws-sdk and how they use the net/http
class, if we don't specify the proxy as an option they will create an net/http
object with http_proxy
set to nil, I believe this will make the library skip the http_proxy
environment variable.
from logstash-input-s3.
Oh, that mixin looks perfect. I see that the SQS input uses it, for example.
Here's the upstream bug, which they claim is fixed in a more recent version than logstash ships with: aws/aws-sdk-ruby#734
from logstash-input-s3.
This is related I believe: aws/aws-sdk-ruby#588
So since this uses aws-sdk < 2 I think we're SOL until its upgraded. I'll see about it if I have some time here, but there is a ticket open to get the mixin updated too
from logstash-input-s3.
I found a workaround, for my use-case at least. A couple, really. First, there's a pull request around somewhere for a fog-based s3 input, called s3fog. Like its author, I wanted to use the s3 input to pull cloudtrail logs into my ELK stack. I ended up using this: https://bitbucket.org/atlassianlabs/cloudtrailimporter. It's designed to skip logstash, which I think is kind of limiting, so I hacked on it: http://www.github.com/lexelby/cloudtrail-logstash/. Works quite nicely. Set up the SNS/SQS stuff as per this: https://github.com/AppliedTrust/traildash. I dumped traildash because I couldn't figure out how to build the darned thing.
from logstash-input-s3.
Well the switchover to v2 of the sdk was quick but I can't seem to install the updated plugin locally for testing. :( This didnt help either: elastic/logstash#2779
from logstash-input-s3.
If anyone else wants to give testing a shot, checkout my fork over here: https://github.com/DanielRedOak/logstash-input-s3 spec tests pass but I haven't gotten to updating the integration tests.
from logstash-input-s3.
PR submitted so this can be closed if/when merged: #25
from logstash-input-s3.
@nowshad-amin Did you find a workarround for this issue?
Using the version with patch from @DanielRedOak worked like a charm.
from logstash-input-s3.
The corresponding PR for this issue was merged, and I've updated to logstash 5.0 and s3-input 3.1.1, but I'm still seeing slower than expected processing times for S3 access logs. This could perhaps be due to the fact that fully utilizing available CPU (hovering around 10-20%). Take this w/ a pinch salt, as I'm running everything on localhost as an ELK stack orchestrated with docker-compose
, but I can see S3 documents coming into elasticsearch slowly but surely (by looking at a stdout
output as well as refreshing a catch-all query in kibana and observing hits). In one example, docker stat
shows:
lookbackelk_logstash_1 21.63% 508 MiB / 3.856 GiB 12.86% 36.14 MB / 54.56 MB 81.92 kB / 47.33 MB 65
lookbackelk_elasticsearch_1 3.91% 630.1 MiB / 3.856 GiB 15.96% 82.77 MB / 61.09 MB 954.4 kB / 230.1 MB 138
lookbackelk_kibana_1 0.60% 255.3 MiB / 3.856 GiB 6.47% 49.33 MB / 11.35 MB 1.044 MB / 0 B 10
and my Mac's CPU & network utilization are both pretty low. Any ideas?
from logstash-input-s3.
I tried upping the pipeline workers & batch size, but didn't notice a huge increase in utilization. Probably just rookie mistakes combined with input size and runtime environment.
from logstash-input-s3.
+1
from logstash-input-s3.
Related Issues (20)
- logstash s3 input enable use_accelerate_endpoint failed
- Logstash S3 input plugin assume role not working HOT 9
- backup_to_dir backups rotation
- seahorse::client:: networking error HOT 2
- S3 input plugin is not reading AWS-KMS (CMK) encrypted bucket HOT 2
- Large files are very slow to read locally HOT 2
- Files being unprocessed with the same last modified timestamp HOT 6
- Could not fetch objects from "requester pays" enabled bucket HOT 1
- logstash-input-s3 3.6.0 restarting due to unrecoverable error with CipherSuites array issue HOT 1
- s3 input plugin not handling shutdown correctly, leading to duplicates once started again HOT 2
- Files being unprocessed because of cutoff time calculation part
- logstash-input-s3 Error: Net::OpenTimeout 没人解决就关闭问题吗?
- [Logstash 7.16.2] S3 input plugin replaces the region in endpnt url HOT 7
- [Docs] Document workaround when using s3 private link endpoints with us-east-1
- sincedb file not created, files from bucket not deleted HOT 2
- S3 input plugin does not work with IAM role WebIdentity HOT 3
- Plugin dont process objects correctly, dont delete or backup HOT 11
- Files with same last modified timestamp miss processing HOT 1
- Constant and frequent s3 plugin restart due to TCP connection failure
- java.util.concurrent.ScheduledThreadPoolExecutor related to S3 input resulting in all pipeline logstash failure. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from logstash-input-s3.