logstash-plugins / logstash-input-s3 Goto Github PK

License: Apache License 2.0

Ruby 100.00%

logstash-input-s3's Introduction

Logstash Plugin

This is a plugin for Logstash.

It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.

Required S3 Permissions

This plugin reads from your S3 bucket, and would require the following permissions applied to the AWS IAM Policy being used:

s3:ListBucket to check if the S3 bucket exists and list objects in it.
s3:GetObject to check object metadata and download objects from S3 buckets.

You might also need s3:DeleteObject when setting S3 input to delete on read. And the s3:CreateBucket permission to create a backup bucket unless already exists. In addition, when backup_to_bucket is used, the s3:PutObject action is also required.

For buckets that have versioning enabled, you might need to add additional permissions.

More information about S3 permissions can be found at - http://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html

Documentation

Logstash provides infrastructure to automatically generate documentation for this plugin. We use the asciidoc format to write documentation so any comments in the source code will be first converted into asciidoc and then into html. All plugin documentation are placed under one central location.

For formatting code or config example, you can use the asciidoc [source,ruby] directive
For more asciidoc formatting tips, see the excellent reference here https://github.com/elastic/docs#asciidoc-guide

Need Help?

Need help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum.

Developing

1. Plugin Development and Testing

Code

To get started, you'll need JRuby with the Bundler gem installed.
Create a new plugin or clone and existing from the GitHub logstash-plugins organization. We also provide example plugins.
Install dependencies

bundle install

Test

Update your dependencies

bundle install

Run tests

bundle exec rspec

2. Running your unpublished Plugin in Logstash

2.1 Run in a local Logstash clone

Edit Logstash Gemfile and add the local plugin path, for example:

gem "logstash-filter-awesome", :path => "/your/local/logstash-filter-awesome"

Install plugin

# Logstash 2.3 and higher
bin/logstash-plugin install --no-verify

# Prior to Logstash 2.3
bin/plugin install --no-verify

Run Logstash with your plugin

bin/logstash -e 'filter {awesome {}}'

At this point any modifications to the plugin code will be applied to this local Logstash setup. After modifying the plugin, simply rerun Logstash.

2.2 Run in an installed Logstash

You can use the same 2.1 method to run your plugin in an installed Logstash by editing its Gemfile and pointing the :path to your local plugin development directory or you can build the gem and install it using:

Build your plugin gem

gem build logstash-filter-awesome.gemspec

Install the plugin from the Logstash home

# Logstash 2.3 and higher
bin/logstash-plugin install --no-verify

# Prior to Logstash 2.3
bin/plugin install --no-verify

Start Logstash and proceed to test the plugin

Contributing

All contributions are welcome: ideas, patches, documentation, bug reports, complaints, and even something you drew up on a napkin.

Programming is not a required skill. Whatever you've seen about open source and maintainers or community members saying "send patches or die" - you will not see that here.

It is more important to the community that you are able to contribute.

For more information about contributing, see the CONTRIBUTING file.

logstash-input-s3's People

Contributors

Stargazers

Watchers

Forkers

ph kain64 alexlovelltroy roman-mazur eg-eng alertme danielredoak psu1025 pmanwatkar naftulikay mspiegle dan-cleinmark alixdeschamps kesor nimrod-becker pegerto torrancew jstangroome daniel-burke-hmh machad gunzy83 vitvegl shuwada sstarcher docker-archive qqshfox kepstein jeredding denmak javifr shanielh ryan-berry erikwebb thaddeus stedelahunty mhoffmann 7digital wiibaa automationd aboutte bobbrez joestump jpburton brandond jjoshi anara123 zoellner alexandernilsson envato-archive daorren ptz0n ivanvc owainperry perryofpeek phirov urbanairship djannot edlab chrisdpa-tvx ktseytlin redaelghamrasni blaze515 progressio william0901 jashchahal tckb divdevar oivoodoo blastworksinc jakelandis anderender dasousa jsvd yaauie danhermann chipmonkster donghun221 kivagant locuzdeveloper guyboertje mdbbs geekpete robbavey henneberger webmat jordansissel barriem milesokeefe nephel brandonwestcott pranspach minecraftxwinp karenzone yangas stevebanik-ndsc blangenfeld lifeofguenter arjun921 jamesmorrissympli tsouza

logstash-input-s3's Issues

Error: No Such Key

Running Logstash 1.5.4 with logstash-input-s3-1.0.0

{:timestamp=>"2015-10-07T06:40:20.333000-0400", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::S3 bucket=>\"my.cloudtrail.bucket1\", prefix=>\"AWSLogs\", access_key_id=>\"MYKEY\", secret_access_key=>\"FOOBAR\", delete=>false, interval=>900, region=>\"us-east-1\", sincedb_path=>\"/var/log/logstash/my.cloudtrail.bucket1.sincedb\", type=>\"cloudtrail\", codec=><LogStash::Codecs::CloudTrail spool_size=>50>, debug=>false, use_ssl=>true, temporary_directory=>\"/var/lib/logstash/logstash\">\n  Error: No Such Key", :level=>:error}
{:timestamp=>"2015-10-07T06:40:51.024000-0400", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::S3 bucket=>\"my.cloudtrail.bucket1\", prefix=>\"AWSLogs\", access_key_id=>\"MYKEY\", secret_access_key=>\"FOOBAR\", delete=>false, interval=>900, region=>\"us-east-1\", sincedb_path=>\"/var/log/logstash/my.cloudtrail.bucket1.sincedb\", type=>\"cloudtrail\", codec=><LogStash::Codecs::CloudTrail spool_size=>50>, debug=>false, use_ssl=>true, temporary_directory=>\"/var/lib/logstash/logstash\">\n  Error: No Such Key", :level=>:error}
{:timestamp=>"2015-10-07T06:41:23.678000-0400", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::S3 bucket=>\"my.cloudtrail.bucket1\", prefix=>\"AWSLogs\", access_key_id=>\"MYKEY\", secret_access_key=>\"FOOBAR\", delete=>false, interval=>900, region=>\"us-east-1\", sincedb_path=>\"/var/log/logstash/my.cloudtrail.bucket1.sincedb\", type=>\"cloudtrail\", codec=><LogStash::Codecs::CloudTrail spool_size=>50>, debug=>false, use_ssl=>true, temporary_directory=>\"/var/lib/logstash/logstash\">\n  Error: No Such Key", :level=>:error}

@ph maybe this can use your monkey fix for sqs input to restart?
Seems to work for periods of time but then threads die.
Also what seems odd is I have this running for few accounts and once I get this error I get no more "cloudtrail" events so it's like all the s3 inputs threads die.

Plugin fails if it tries to ingest a binary file

Here is the log (the source of the problem was some PNG file in my bucket):

2015-04-21T08:03:18.310519241Z {:timestamp=>"2015-04-21T08:03:18.310000+0000", 
:message=>"A plugin had an unrecoverable error. Will restart this plugin.
Plugin: <LogStash::Inputs::S3>[...]
Error: invalid byte sequence in UTF-8", :level=>:error}

Maybe it should be wrapped in begin/rescue block so the whole plugin won't crash?

Error: AWS::S3::Errors::Forbidden

I get the following forbidden error, with input filter configured for aws s3

Plugin: <LogStash::Inputs::S3 add_field=>{"kitch_blog"=>"loadbalancer"}, bucket=>"com.xxxxx", access_key_id=>"xxxxxx", secret_access_key=>"xxxxx", prefix=>"AWSLogs/766254664790/elasticloadbalancing/us-east-1/", type=>"apache-access", sincedb_path=>"/etc/logstash/sincedb/", region=>"us-east-1", temporary_directory=>"/tmp/logstash">
  Error: AWS::S3::Errors::Forbidden {:level=>:error}

I've uploaded the debug mode output here:
https://gist.githubusercontent.com/maaand/92e0d8dcc24210eeb191/raw/6e88ab1098a2431b93839d51a13d6c72affe286e/gistfile1.txt

Add ZIP support on s3 input

(This issue was originally filed by @pagefaulted at elastic/logstash#1759)

Add ZIP support on s3 input

My use case is the ability to process AWS billing logs

Missing S3 Asia Pacific (Seoul) Region "ap-northeast-2"

Region configuration for ap-northeast-2 returns an error:

$ /opt/logstash/bin/logstash -f logstash-s3-test.conf --configtest
Invalid setting for s3 input plugin:

input {
s3 {
# This setting must be a ["us-east-1", "us-west-1", "us-west-2", "eu-central-1", "eu-west-1", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "sa-east-1", "us-gov-west-1", "cn-north-1"]
# Expected one of ["us-east-1", "us-west-1", "us-west-2", "eu-central-1", "eu-west-1", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "sa-east-1", "us-gov-west-1", "cn-north-1"], got ["ap-northeast-2"]
region => "ap-northeast-2"
...
}
} {:level=>:error}

$ /opt/logstash/bin/logstash --version
logstash 2.2.2
$ /opt/logstash/bin/plugin list --verbose | grep input-s3
logstash-input-s3 (2.0.4)
$

Error: No such file or directory - /tmp/logstash

doesn't seem to work at all, i keep getting this error. (logstash version 1.5.0-rc3)
with version logstash 1.4.2 it works but i get errors about utf-8.
running with this configuration:
input {
s3 {
bucket => "bucketname"
region => "region"
access_key_id => "key"
secret_access_key => "secret"
type => "elb"
prefix => "elb/www/AWSLogs/1234567890/elasticloadbalancing/region/2015/04/"
interval => 60
}
}
any ideas ?
thanks.

Increase verbosity?

I have a feeling I am running in to the issue where it just takes a long time to enumerate and start copying items from huge buckets but I don't know whether or not it is processing.

I have tested my config and Logstash says it is ok but when I start Logstash there is no output to tell whether or not the s3 plugin is picking up and processing logs, other than the initial message that says it got loaded. Is there something I can do to help troubleshoot this?

sincedb default should not use ENV["HOME"]

Looks like by default sincedb will try to use $HOME:

https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L274

This is problematic, because $HOME may not be set. Per the POSIX standard, this variable is set by the login program:

HOME
The system shall initialize this variable at the time of login to be a pathname
of the user's home directory. See <pwd.h>.
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html

In my case, this failed when I ran logstash as an upstart job with the logstash service account.

This leads to the following:

  Error: can't convert nil into String
  Exception: TypeError
  Stack: org/jruby/RubyFile.java:2023:in `join'
org/jruby/RubyFile.java:861:in `join'
/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-3.1.1/lib/logstash/inputs/s3.rb:276:in `sincedb_file'
/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-3.1.1/lib/logstash/inputs/s3.rb:263:in `sincedb'
/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-3.1.1/lib/logstash/inputs/s3.rb:102:in `list_new_files'

What should be used instead? Perhaps, $LOGSTASH_HOME?

Unusual number of HEAD requests being made by S3 input plugin

We have S3 access logs being collected in a bucket. We are using S3 input plugin to index these files into ELK.

After a couple of months usage we noticed unusual no of requests made to S3 (~1 Billion/Month) which costs $440, this is only the charge for the no of requests which is negligible for most of the use cases, and no one even bothers about this cost.

When I looked at the billing reports, there were around 950 Million HEAD reqeusts made to the bucket which has these logs.
S3 input plugin must be making all these requests (file watching?)

I am not sure if there is any need to do some kind of optimization on the plugin part.
I think the logs that people store in S3 don't change over time(my assumption), so if a file is indexed already, then there is no need to watch that.

From user perspective, the options I can think of, to avoid these requests are

Move the files to different location after the indexing is done
Download the files to local drive using a cron job and use file input plugin to index to ES
Use daily prefixes, so that plugin watches only those files, log files are named with timestamps
Change the default interval to something higher if having some delay is fine, S3 access logs are hourly generated, so there is an hour delay anyway.

Any opinions and suggestions are welcome.

Thanks

S3 add bucket and path as meta information

(This issue was originally filed by @pvanisacker at elastic/logstash#1840)

Currently when S3 file get imported the name of the file and the bucket is not stored.
It would be nice to have that as a field as well.

S3 input can take a long time to start and a long time to stop

When you have a bucket with a really large quantity of files it can take a while to start because of all the api calls the code has to do. #25 optimize the numbers of call to a reasonable about by using v2 of the API, but this still problematic.

The plugin can also take a really long time to stop, the current architecture of the plugin is single threaded. This mean the following: the listing of remote files, the downloading, the uncompressing and the actual processing is done in a single thread.

The stop doesn't correctly interrupt this chain.

We need to decouple theses part in different stages to better control the flow of execution of this plugin.

OOM Error with logstash s3 input

I am trying to index 4 months of logs from multiple buckets. I have logstash and elasticsearch running on a single m3.xlarge(16 GB RAM) instance.
Logstash version: 1.5.0-rc3
Elasticsearch version: 1.5.2
ArchLinux

After running for some time logstash or elasticsearch would get killed by OOM Killer, If I restart logstash , same thing will happen in few minutes.
If I reboot the machine, problem won't be there. Memory usage is constant around 70% (50% is for ES Heap)

I have made some observation that /tmp/logstash will be filled up with lot of files, I think s3 input isn't deleting the files after its done with indexing, and it also seems to download files from S3 as soon as it can, irrespective of how fast it is indexing. Rebooting would delete these files, I guess that's why I was able to get logstash back up running after rebooting. I am not sure what temp file got to do with RAM, may be s3 input plugin is trying to load all those files into RAM?

logstash config
https://gist.github.com/vanga/74bc60e88d7e1022b53d
OOM error
https://gist.github.com/vanga/b499b428d8071a72bc79

s3 unrecoverable error: unexpected token

logstash-plugins/logstash-codec-cloudtrail#1 (comment)

{:timestamp=>"2015-06-11T16:12:27.808000-0400", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::S3 bucket=>\"our.cloudtrail.NNNNNNNN\", credentials=>[\"XXXXXXXXXXXXX\", \"YYYYYYYYYYYY\"], region_endpoint=>\"us-east-1\", sincedb_path=>\"/var/log/logstash/cloudtrail.NNNNNNN.sincedb\", type=>\"cloudtrail\">\n  Error: unexpected token at '0c22b8e6593b3eabfb00cf5f1ed73cba1a1200fdf19aeb0646b9da1d01522010 produsaphotoevent [07/Jun/2015:23:40:37 +0000] XXX.XXX.XXX.10 arn:aws:iam::NNNNNNNN:user/Prod_User 5B8075E1A6D42855 REST.GET.BUCKET - \"GET /?prefix=HeartBeat%2FHeartBeat.txt HTTP/1.1\" 200 - 605 - 59 58 \"-\" \"aws-sdk-dotnet/1.5.18.0 .NET Runtime/4.0 .NET Framework/4.0 OS/6.1.7601.65536 S3Sync\" -\n'", :level=>:error}

I've purged the bucket and then new message seems to crash s3 plugin.

Is this a bug in the cloudtrail codec or the s3 input or both?

CloudTrail-optimized polling

A very common use case for S3 polling is ingest of CloudTrail logs, which have a fixed key format within a bucket:
/AWSLogs/<AccountId>/CloudTrail/<region>/<YYYY>/<MM>/<DD>/<AccountId>_CloudTrail_<region>_<ISODate>_<random>.json.gz

Given this fixed structure, ingest and incremental polling can be optimized given:

Objects will not be rewritten or appended to once created
Within a given account and region, only one sub-prefix (the current date) will be written to.

The process would look something like:

Walk the prefix tree to build an initial list of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes
For each prefix in the list, spawn a poller thread:
- Walk the prefix tree to the first <YYYY>/<MM>/<DD>/ sub-prefix
- List objects within this prefix, paging through results using max_keys, next_continuation_token, and start_after until no further objects are returned
- When no further objects are returned, remove the <DD> token from current_prefix and call list_objects_v2({prefix: parent_prefix, start_after: current_prefix})
- If a new common prefix is returned, update current_prefix and begin listing objects
- If no new prefix is returned, repeat for <MM> and <YYYY> tokens
- If no new sub-prefix is discovered, store last object key as start_after and sleep for a period of time
- Re-start polling loop
Periodically check to see if new /AWSLogs/<AccountId>/CloudTrail/<region> prefixes are present and spawn new poller threads as necessary
If a poller thread's /AWSLogs/<AccountId>/CloudTrail/<region> prefix disappears, it should terminate.

Using the above logic, the lastdb file only needs to persist a small amount of information:

List of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes with:
- current_prefix (<YYYY>/<MM>/<DD>/)
- next_continuation_token (opaque)
- start_after (last object key processed)

I am happy to work on this with an optimized poller class that could be selected via configuration option. Not sure if I should fork the current master branch, or the WIP threading branch?

Add Needed Metatdata

Currently the S3 plugin only adds a Message entry to the general message that gets sent to Elasticsearch. While this at least gets some of the information there, it would be extremely helpful to have the following added:

bucket -> Name of the Bucket being processed
path -> Path to the file line is extracted from
prefix -> S3 prefix being processed

This is metadata that would be extremely useful to have access to. For example when consuming EMR Logging. The Path would have the Cluster ID on it which would allow us to grok that into it's own field for searching.

Enhancement Request - Use aws-mixin to enable support for IAM roles

Would ❤️ X 💯 to have this plugin updated to use the existing logstash-aws-mixin and support IAM roles for authenticating to S3. The use of hard coded credentials (or environment variables) is a real throwback and not consistent with how Things Are Done(tm) on AWS these days. Proper instance role support would be a huge win!

IAM credentials not being recognised

This was initially reported in a Discuss thread here.

Config;

input {
        s3 {
                bucket              => "mybucketname-logs-cloudtrail"
                access_key_id       => "ACCESS_KEY_HERE"
                secret_access_key   => "SECRET_KEY_HERE"
                region              => "eu-west-1"
                codec               => "cloudtrail"
                type                => "cloudtrail"
                prefix              => "AWSLogs/AWS_ACCOUNT_ID_HERE/CloudTrail/eu-west-1/2015/09/27"
                temporary_directory => "/tmp/temp-cloudtrail_s3_temp"
                sincedb_path        => "/tmp/temp-cloudtrail_s3_sincedb"
                debug               => "true"
        }
}
output {
        elasticsearch {
                host => "ELASTICSEARCH_URL_HERE"
                protocol => "http"
        }
        stdout {
                codec => "rubydebug"
        }
}

There's some debug logging here as well.

I hope the discuss thread and this gives you a good idea of the situation, but let me know if my summary sucks :)

S3 input and large buckets

migrated from: https://logstash.jira.com/browse/LOGSTASH-2125

S3 input is takeing a long time until the first logfile is processed:

input {
    s3 {
        credentials => ["XXXX","XXXX"]
        bucket => "my-production-bucket"
        interval => 300
   }
}
output {
   stdout {}
}

Running it with

sudo ./logstash agent -f /etc/logstash/conf.d/central.conf  --debug

shows me that the bucket is used. As soon as I start logstash I see via tcpdump that there is a lot of traffic between the host and s3 going on.
Now that bucket has right now 4451 .gz files just in the root folder. Subfolders have even more files.
If I create now another bucket and put only one of the log files in it I can see that this logfile is more or less immediately downloaded and processed.

Unable to specify Proxy for s3 input

My environment mandates that a proxy be used for all connections to the general internet.

Using the s3 input in 1.5.0.rc2 does not work as a result, because I cannot find a way to specify the proxy.

I tried setting the java environment proxy options with:

$env:LS_JAVA_OPTS="-Dhttp.proxyHost=$address -Dhttp.proxyPort=$port"

But it didnt seem to make much of a difference, I assume because its using the Ruby aws-sdk?

The error I get in the output is:

{:timestamp=>"2015-04-21T16:58:51.782000+1000", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::S3 bucket=>\"oth.console.liveagentservice.Test-tbowles-15110-12112.logs\", prefix=>\"ApiLoadBalancer\", region_endpoint=>\"ap-southeast-2\", region=>\"us-east-1\", temporary_directory=>\"C:/Windows/TEMP/logstash\">\n  Error: execution expired", :level=>:error}

The operative part is "Error: execution expired", which is typical for a request being made and timing out as a result of not going through the proxy.

I'm using Logstash 1.5.0.rc2 on a Windows machine. Does anyone know if the aws-sdk reads an environment variable for proxy settings or if there is some way that I can set it? I have a valid proxy in the HTTP_PROXY environment variable (of the form http://[ADDRESS]:[PORT]) and I can easily shift it around to wherever it is required.

Without the ability to use a proxy, I can't use the s3 input, which means I can't process the logs from ELB without significant additional effort (some sort of file sync from s3 to a local folder via Powershell, then processed by Logstash maybe?). I'd prefer to just use the S3 input.

can't convert Symbol into Integer Error with Logstash-input-s3 for CloudTrail Global log bucket as input

I'm trying to use LS2.1.1 with ES2.1.1 for AWS CloudTrail Log Analysis from S3 Bucket using logstash-input-s3 and logstash-codec-cloudtrail plugin.
I am facing problem when trying to start my Logstash service where as my configuration test passed successfully.
For debugging the problem I tried the --debug command and got the below error.

Conmmand : /opt/logstash/bin/logstash -f logstash-s3.conf –debug
*_Part of error_*
Settings: Default filter workers: 1
Registering s3 input {:bucket=>"cloudtrailbucket", :region=>"ap-southeast-1", :level=>:info, :file=>"logstash/inputs/s3.rb", :line=>"78", :method=>"register"}
The error reported is:
can't convert Symbol into Integer
org/jruby/RubyString.java:3919:in []='
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-mixin-aws-2.0.2/lib/logstash/plugin_mixins/aws_config/v1.rb:40:inaws_options_hash'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.3/lib/logstash/inputs/s3.rb:397:in get_s3object'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.3/lib/logstash/inputs/s3.rb:80:inregister'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.1.1-java/lib/logstash/pipeline.rb:165:in start_inputs'
org/jruby/RubyArray.java:1613:ineach'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.1.1-java/lib/logstash/pipeline.rb:164:in start_inputs'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.1.1-java/lib/logstash/pipeline.rb:100:inrun'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.1.1-java/lib/logstash/agent.rb:165:in execute'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.1.1-java/lib/logstash/runner.rb:90:inrun'
org/jruby/RubyProc.java:281:in call'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.1.1-java/lib/logstash/runner.rb:95:inrun'
org/jruby/RubyProc.java:281:in call'
/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/task.rb:24:ininitialize'

My configuratoion used for logstash-input-s3:
input {
s3 {
bucket => "cloudtrailbucket"
delete => false
interval => 60 # seconds
prefix => "AWSLogs//CloudTrail/ "
type => "cloudtrail"
codec => "cloudtrail"
region => "ap-southeast-1"
aws_credentials_file => "/etc/logstash/conf.d/s3_credentials.ini"
sincedb_path => "/opt/logstash_cloudtrail/sincedb"
}
}

output {
elasticsearch {
hosts => "localhost:9200"
index => "Client-Cloudtrail"
}
}
Has anyone done this before and can share the steps.

inputs/s3.rb: Keep attempting and failing to parse the same gz file if it's corrupted

Moved from elastic/logstash#2090

I'm using logstash v1.4.2. S3 input crashes if a bucket contains a zero-byte gz file. (throwing an error "Error: not in gzip format".) Then, logstash will retry indefinitely since sincedb is not updated.

I've made a change in list_new to skip empty S3 objects (attached code below) to address this particular issue but it may be a good idea to have more robust error handling.

  private
  def list_new(since=nil)

    if since.nil?
      since = Time.new(0)
    end

    objects = {}
    @s3bucket.objects.with_prefix(@prefix).each do |log|
      # original code in v1.4.2
      # if log.last_modified > since
      if log.last_modified > since && log.content_size > 0
        objects[log.key] = log.last_modified
      end
    end

    return sorted_objects = objects.keys.sort {|a,b| objects[a] <=> objects[b]}

  end # def list_new

refactor shutdown sequence to use plugin.stop

Change shutdown sequence to be triggered by plugin.stop instead of ShutdownSignal exception.
Also remove any calls to: shutdown, finished, finished?, running? or terminating?
This depends on elastic/logstash#3895

Failing test on Logstash_Default_Plugins_21

http://build-eu-00.elastic.co/view/LS%202.1/job/Logstash_Default_Plugins_21/8/console

  1) LogStash::Inputs::S3 when working with logs when event doesn't have a `message` field deletes the temporary file
     Failure/Error: events = fetch_events(config)
     ArgumentError:
       no time information in ""
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:413:in `read'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:408:in `newer?'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:114:in `list_new_files'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:110:in `list_new_files'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:144:in `process_files'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/spec/support/helpers.rb:5:in `fetch_events'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/spec/inputs/s3_spec.rb:247:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.8/lib/rspec/wait.rb:46:in `(root)'
     # ./rakelib/test.rake:54:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:240:in `execute'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:235:in `execute'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:179:in `invoke_with_call_chain'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:172:in `invoke_with_call_chain'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:165:in `invoke'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:150:in `invoke_task'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:106:in `top_level'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:106:in `top_level'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:115:in `run_with_threads'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:100:in `top_level'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:78:in `run'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:176:in `standard_exception_handling'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:75:in `run'

  2) LogStash::Inputs::S3 when working with logs when event doesn't have a `message` field should process events
     Failure/Error: events = fetch_events(config)
     ArgumentError:
       no time information in ""
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:413:in `read'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:408:in `newer?'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:114:in `list_new_files'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:110:in `list_new_files'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/lib/logstash/inputs/s3.rb:144:in `process_files'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/spec/support/helpers.rb:5:in `fetch_events'
     # ./vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.2/spec/inputs/s3_spec.rb:242:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.8/lib/rspec/wait.rb:46:in `(root)'
     # ./rakelib/test.rake:54:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:240:in `execute'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:235:in `execute'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:179:in `invoke_with_call_chain'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:172:in `invoke_with_call_chain'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/task.rb:165:in `invoke'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:150:in `invoke_task'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:106:in `top_level'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:106:in `top_level'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:115:in `run_with_threads'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:100:in `top_level'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:78:in `run'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:176:in `standard_exception_handling'
     # ./vendor/bundle/jruby/1.9/gems/rake-10.4.2/lib/rake/application.rb:75:in `run'

Moving back the gzip files back into the input.

Currently our codecs only work with bytes, take some bytes in and outputs some bytes. I did a breaking change in allowing the codec to use a IO object and I am reverting this back, this was a bad decision on my part.

To move the compressed files support into a codec we will need to support working with IO into the codecs and keep backward compatibilities. This would require a few changes in the plugins and how we work the data. I'll open a ticket to discuss it.

Slow process Cloudtrail log from S3

input {
s3 {
bucket => "BUCKETNAME"
delete => true
interval => 60
prefix => "AWSLogs/"
region => "us-west-2"
codec => "cloudtrail"
type => "cloudtrail"
}
}

output {
stdout { codec => rubydebug }
}

{:timestamp=>"2016-02-26T19:26:17.580000+0000", :message=>"S3 input: Download remote file", :remote_key=>"AWSLogs/AWSAccountID/CloudTrail/us-west-2/2016/02/26/AWSAccountID_CloudTrail_us-west-2_20160226T0800Z_g8AAxJTg73yheICE.json.gz", :local_filename=>"/tmp/logstash/AWSAccountID_CloudTrail_us-west-2_20160226T0800Z_g8AAxJTg73yheICE.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"344", :method=>"download_remote_file"}
{:timestamp=>"2016-02-26T19:26:19.262000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:24.266000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:29.267000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:34.270000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:39.271000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:44.271000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:49.272000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:54.272000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
{:timestamp=>"2016-02-26T19:26:59.272000+0000", :message=>"Pushing flush onto pipeline", :level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}

strace -p 3203
Process 3203 attached
futex(0x7fc50e1bd9d0, FUTEX_WAIT, 3218, NULL^CProcess 3203 detached
<detached ...>

strace -p 3218
futex(0x7fc50800b354, FUTEX_WAIT_BITSET_PRIVATE, 1, {5392, 782048749}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7fc50800b328, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1456515307, 329336}, NULL) = 0
gettimeofday({1456515307, 329479}, NULL) = 0
gettimeofday({1456515307, 329604}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {5392, 782924458}) = 0
futex(0x7fc50800b354, FUTEX_WAIT_BITSET_PRIVATE, 1, {5392, 982924458}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7fc50800b328, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1456515307, 530201}, NULL) = 0
gettimeofday({1456515307, 530347}, NULL) = 0
gettimeofday({1456515307, 530478}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {5392, 983801768}) = 0
futex(0x7fc50800b354, FUTEX_WAIT_BITSET_PRIVATE, 1, {5393, 183801768}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7fc50800b328, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1456515307, 731088}, NULL) = 0
gettimeofday({1456515307, 731234}, NULL) = 0
gettimeofday({1456515307, 731363}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {5393, 184683466}) = 0

batch size

Currently every "interval (60 seconds)" input only downloads 1 file and processes.
A bucket with thousands of files will take thousands of minutes (1 min per file) to ingest.
If there can be a bulk/batch option to download X files and process then can hopefully can ingest bucket with lots of files faster.
Also specifying -w # doesn't appear to improve by utilizing threads for multiple items either.
So only way to improve this would be to have multiples of the same input which could potentially cause conflicts? (maybe specifying same sincedb might allow them to work together nicely?)

Consume files under buckets owned by a different AWS account

Hello

I need to read from a bucket that I do not own, for this I need to switch my role once I am authenticated as per documented at http://docs.aws.amazon.com/IAM/latest/UserGuide/roles-walkthrough-crossacct.html

The use of assume role can be implemented as described http://docs.aws.amazon.com/IAM/latest/UserGuide/roles-usingrole-switchapi.html

Thoughts ?

.gzip extension

logstash-input-s3 currently supports .gz for gzip extension. It makes sense to support .gzip as well.

Will below change work?

https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb

from

  private
  def gzip?(filename)
    filename.end_with?('.gz')
  end

  private
  def gzip?(filename)
    filename.end_with?('.gz’) || filename.end_with?(‘gzip’)
  end

Mumbai Region (ap-south-1) is not supported

Mumbai region is not supported with current version. Please add mumbai region also.

[Feature Request] Add the ability to work with a S3 Compatible storage

In its current state, the plugin can only work with AWS. I've tried to add the capability to work with any S3 compatible storage (which supports the AWS authentication) by adding the endpoint into the S3 object creation.

However, it seems that the current SDK version does not support endpoint in the initialization of the S3 object (V2 of the SDK supports it via the AWS::Client object).

Edit:
For some reason I'm able to use the Aws::S3::Resource object (although according to the docs, it should only be available at the v2 of the SDK). Using it I can

client = Aws::S3::Client.new(
        ...
        :endpoint => @s3_endpoint
        )
s3 = Aws::S3::Resource.new(client: client)

Working with the s3 object, I can access the S3 compatible bucket, however, the entire plug-in breaks (obviously) as it expects an s3 object with .buckets (and not a single resource). Sadly I'm not proficient in Ruby enough to go over and fix all the flows.

S3 v4 signatures for logstash input plugin

Hi ,

I was trying to collect all my AWS config logs from S3 via logstash and could see it list the filenames fine but once listing is complete it starts to download the S3 file and that fails with an error "Error: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.". Once it gets this error it will restart the plugin and start listing the files again and then error.

Here is the complete log (replaced file paths)

S3 input: Download remote file {:remote_key=>"logs/111/file_brqvFXbF63.json.gz", :local_filename=>"/tmp/logstash/file_brqvFXbF63.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"344", :method=>"download_remote_file"}
A plugin had an unrecoverable error. Will restart this plugin.
Plugin: "s3bucket", codec=>"UTF-8">, interval=>600, prefix=>"logs/111/", region=>"eu-west-1", use_ssl=>true, delete=>false, temporary_directory=>"/tmp/logstash">
Error: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.
Exception: AWS::S3::Errors::InvalidArgument
Stack: /opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/core/client.rb:375:in return_or_raise'
/opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/core/client.rb:476:inclient_request'
(eval):3:in get_object'
/opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/s3/s3_object.rb:1371:inget_object'
/opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/s3/s3_object.rb:1090:in read'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:346:indownload_remote_file'
org/jruby/RubyIO.java:1183:in open'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:345:indownload_remote_file'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:321:in process_log'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:151:inprocess_files'
org/jruby/RubyArray.java:1613:in each'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:146:inprocess_files'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:102:in run'
org/jruby/RubyProc.java:281:incall'
/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/interval.rb:20:in interval'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-s3-2.0.6/lib/logstash/inputs/s3.rb:101:inrun'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.2.3-java/lib/logstash/pipeline.rb:331:in inputworker'
/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.2.3-java/lib/logstash/pipeline.rb:325:instart_input' {:level=>:error, :file=>"logstash/pipeline.rb", :line=>"342", :method=>"inputworker"}
S3 input: Found key {:key=>"logs/111/fil2_akbtQDyyS6nHW.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"111", :method=>"list_new_files"}

I can see in logstash-output-s3 plugin "Support S3 v4 signatures" was added some time this year. Does the input plugin has support for AWS S3 v4 signatures?

Regards,
Abey

Dry up rspec suite using subject

Follow more advices in this PR to refactor the suite.

#16

Unable to use aws_credentials_file as in S3 output

I'm trying to use this S3 input plugin to read files previously written by the S3 output plugin.

The following output config works just fine:

input { 
    file { 
        path => [ '/var/log/syslog' ]
    }
}

output {
    s3 {
        aws_credentials_file => '/srv/logstash/config/input.conf'
        bucket => 'mahbucket'
        use_ssl => true
    }
}

However, the same configuration for using S3 input fails:

input {
    s3 {
        aws_credentials_file => '/srv/logstash/config/input.conf'
        bucket => 'mahbucket'
        use_ssl => true
    }
}

output {
    stdout { }
}

I get the following errors:

A plugin had an unrecoverable error. Will restart this plugin.
  Plugin: <LogStash::Inputs::S3 aws_credentials_file=>"/srv/logstash/config/aws-credentials.yml", bucket=>"mahbucket", region=>"us-east-1", temporary_directory=>"/tmp/logstash">
  Error: 
Missing Credentials.

Unable to find AWS credentials.  You can configure your AWS credentials
a few different ways:

* Call AWS.config with :access_key_id and :secret_access_key

* Export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to ENV

* On EC2 you can run instances with an IAM instance profile and credentials
  will be auto loaded from the instance metadata service on those
  instances.

* Call AWS.config with :credential_provider.  A credential provider should
  either include AWS::Core::CredentialProviders::Provider or respond to
  the same public methods.

= Ruby on Rails

In a Ruby on Rails application you may also specify your credentials in
the following ways:

* Via a config initializer script using any of the methods mentioned above
  (e.g. RAILS_ROOT/config/initializers/aws-sdk.rb).

* Via a yaml configuration file located at RAILS_ROOT/config/aws.yml.
  This file should be formated like the default RAILS_ROOT/config/database.yml
  file.

 {:level=>:error}

I'm using the same AWS credentials file for input and for output and it works for output but not for input, hence I'm filing this bug.

I'm using Logstash 1.5.0.rc2 and Logstash S3 Input 0.1.8.

Multiple Files Being Left Unprocessed with Identical Timestamps

I've had a long standing issue with LS 1.5.x leaving files pending in my S3 input bucket and not being read. In tracing this down this (https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L408) test looks suspect to me. I am processing S3 files on a batch basis and have multiple files present on start up with identical last_modified timestamps (second-level precision).

It looks like the newer function, using a greater than, will process the first file it finds with the earliest timestamp, but skip over any subsequent ones.

For example:

Timestamp	File
2015-01-01 08:00:00Z	file1.gz
2015-01-01 08:00:00Z	file2.gz
2015-01-01 08:00:00Z	file3.gz
2015-01-01 08:00:01Z	file4.gz
2015-01-01 08:00:02Z	file5.gz
2015-01-01 08:00:03Z	file6.gz

If I'm understanding the code, file1.gz will get parsed, but file2.gz and file3.gz will fail the > test and will be skipped, leaving them hanging around after the conclusion of the batch run.

Is my interpretation correct and could this be changed safely to >=? I'm using delete after read so the concern of re-reading the same file isn't a problem for me.

Error: no time information in \"\"

Suddenly, I'm getting following error in loop:

{:timestamp=>"2015-11-16T10:52:24.320000+0000", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::S3 bucket=>\"fs-csm-dev\", type=>\"elb\", prefix=>\"access-logs/\", region=>\"eu-west-1\", delete=>true, interval=>60, debug=>false, codec=><LogStash::Codecs::Plain charset=>\"UTF-8\">, use_ssl=>true, temporary_directory=>\"/var/lib/logstash/logstash\">\n  Error: no time information in \"\"", :level=>:error}

I am not out of disk space. The directory is:

$ ls -la /var/lib/logstash/logstash/
total 8
drwxr-xr-x 2 logstash logstash 4096 Nov 16 10:55 .
drwxrwxr-x 4 logstash logstash 4096 Oct  6 07:24 ..

Running logstash 1.5.5.

Reject invalid bucket name

You can use any characters when defining the bucket name for s3.
Adding some sort of validation on the string would help the user debug configuration error and prevent user to define the file prefix in the bucket name.

Logstash S3 input doesn't recognise AWS region us-east-2

Logstash v2.2.4

The error:
Invalid setting for s3 input plugin:

input {
s3 {

This setting must be a ["us-east-1", "us-west-1", "us-west-2", "eu-central-1", "eu-west-1", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "sa-east-1", "us-gov-west-1", "cn-north-1"]

Expected one of ["us-east-1", "us-west-1", "us-west-2", "eu-central-1", "eu-west-1", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "sa-east-1", "us-gov-west-1", "cn-north-1"], got ["us-east-2"]

region => "us-east-2"
...
}
} {:level=>:error}
Error: Something is wrong with your configuration. {:level=>:error}

My configuration:
s3 {
bucket => "mybucketname"
access_key_id => "key"
secret_access_key => "secret"
region => "us-east-2"
prefix => "AWSLogs/89318..."
type => "sometype"
}

Windows test failure

From elastic/logstash#2487

  27) LogStash::Inputs::S3#list_new_files should accepts a list of credentials for the aws-sdk, this is deprecated
     Failure/Error: expect{ config.register }.not_to raise_error
       expected no Exception, got #<Errno::ESRCH: No such process - C:\Users\jls\Documents\GitHub\logstash\tmp\mybackup> with backtrace:
         # C:\Users\jls\Documents\GitHub\logstash\lib\logstash\runner.rb:57:in `run'
         # C:\Users\jls\Documents\GitHub\logstash\lib\logstash\runner.rb:112:in `run'
         # C:\Users\jls\Documents\GitHub\logstash\lib\logstash\runner.rb:170:in `run'
     # C:\Users\jls\Documents\GitHub\logstash\lib\logstash\runner.rb:57:in `run'
     # C:\Users\jls\Documents\GitHub\logstash\lib\logstash\runner.rb:112:in `run'
     # C:\Users\jls\Documents\GitHub\logstash\lib\logstash\runner.rb:170:in `run'

SinceDB should support file offset and not only the keys

The current sincedb implementation of this plugin only relies on the object key and doesn't use the file offset at all, so when we stop logstash in a middle of reading a file we don't have the choice to read the file back at the beginning causing duplicates in the log stream.

We should investigate if we could use the filewatch plugins to actually do the file reading and the S3 input could act as a downloading agent.

Setting up a prefix in logstash make him treat it like an actual file

If you set up a prefix in the config to target specific logs, aws-sdk will treat the prefixed directory as an actual file and the plugin will crash because he will try to treat it as a log file.

We dont have a test for this particular case and the bug was introduced when we added the option to backup in another s3 prefix.

Error: undefined method `start_with?

Running Logstash 1.5.0.RC2 and getting the following error message:

A plugin had an unrecoverable error. Will restart this plugin.
Error: undefined method `start_with? for nil:NilClass {:level=>:error}

Here is the relevant part of my current config.

input
s3 {
        bucket => "my-s3-bucket"
        access_key_id => "XXX"
        secret_access_key => "XXX"
        sincedb_path => "/data/logstash/.sincedb-logstash"
        prefix => "path/to/logs"
        type => "cloudtrail"
        codec => "json"
    }

output {
    elasticsearch { cluster => "elasaticsearch" }
}

I tested with the stdout output and things seemed to work okay. I tested this config with no codec defined and it pulled logs in but it only created one entry in ES.

Not sure if that means anything or not.

Multiple Broker/Indexers ingesting

I'd like to ask about the ability or maybe how would you have multiple logstash servers (for HA) possibly pulling from same s3 input but doesn't share sincedb_path.
Could you use a NFS/GFS filesystem and have more than one instance of logstash using the same sincedb file?
This might not even be possible but would be really helpful when s3 input threads might die (logstash still running) and now ingestion has stopped.
Obviously fixing the s3 input thread from dying is the correct fix but for HA if the LS node died it would be nice if you could have 2 running so could fix node without downstream data loss/backup/delay.

Dry up update_metadata

Dry up the update_metadata function into 2 separates methods

s3 input with cloudtrail codec not working with gzipped files

Not sure if this is a problem with s3 input plugin or the cloudtrail codec... but I can't seem to get the s3 input with cloudtrail codec working if the file is gzipped (which is the default for cloudtrail). It does work if I download the file, unzip it, and upload it back into a different S3 bucket.

logstash 2.2.2
logstash-input-s3 2.0.4
logstash-codec-cloudtrail 2.0.2

I started out with a normal cloudtrail bucket created by AWS, and a simple config like this:

input {   
    s3 {
        bucket => "cloudtrail-logs" 
        codec => cloudtrail {}
    }
}

output {
    stdout { codec => rubydebug }
}

When I run logstash with --debug, I see this:

S3 input: Adding to objects[] {:key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"116", :method=>"list_new_files"}
S3 input processing {:bucket=>"cloud-analytics-platform-cloudtrail-logs", :key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"150", :method=>"process_files"}
S3 input: Download remote file {:remote_key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :local_filename=>"/var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"344", :method=>"download_remote_file"}
Processing file {:filename=>"/var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"182", :method=>"process_local_log"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}

And it just keeps printing that last line over and over and never does anything else. If I go look in /var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/ I do indeed see a gzipped file, blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz .

Now, if I unzip this file, and create myself a test bucket, and put the unzipped file into the test bucket, and run logstash pointing at my test bucket, it works fine!

According to the docs at https://www.elastic.co/guide/en/logstash/current/plugins-inputs-s3.html , if the filename ends in .gz then the s3 input should handle it automatically.

Unit test broken

See: http://build-eu-00.elastic.co/view/LS%20Plugins/view/LS%20Inputs/job/logstash-plugin-input-s3-unit/jdk=JDK7,nodes=metal-pool/18/console

Using rspec 3.1.0
Using rspec-wait 0.0.8
Using logstash-devutils 0.0.18
Using logstash-mixin-aws 2.0.2
Using logstash-input-s3 1.0.0 from source at .
Using bundler 1.10.6
Bundle complete! 5 Gemfile dependencies, 57 gems now installed.
Use `bundle show [gemname]` to see where a bundled gem is installed.
RuntimeError: Logstash expects concurrent-ruby version 0.9.1 and version 0.9.2 is installed, please verify this patch: /mnt/jenkins/rbenv/versions/jruby-1.7.20/lib/ruby/gems/shared/gems/logstash-core-2.0.0.snapshot5-java/lib/logstash/patches/silence_concurrent_ruby_warning.rb
           (root) at /mnt/jenkins/rbenv/versions/jruby-1.7.20/lib/ruby/gems/shared/gems/logstash-core-2.0.0.snapshot5-java/lib/logstash/patches/silence_concurrent_ruby_warning.rb:53
          require at org/jruby/RubyKernel.java:1072
           (root) at /mnt/jenkins/rbenv/versions/jruby-1.7.20/lib/ruby/gems/shared/gems/logstash-core-2.0.0.snapshot5-java/lib/logstash/patches.rb:1
          require at org/jruby/

S3 input error The AWS Access Key Id you provided does not exist in our records

(This issue was originally filed by @karasmeitar at elastic/logstash#3193)

I'm trying to get some log files from s3 bucket and put it to elasticsearch.
My config file is:
input {
s3 {
bucket => "dist-platform-qa"
prefix => "es_export_data"
credentials =>"/home/dev/logstash-1.4.2/Aws.config"
region_endpoint => "us-east-1"
}
}
output {
elasticsearch {
host => "localhost"
protocol => "http"
port=> "9200"
index=> "all"
}
}

My Aws.config file:

AWS_ACCESS_KEY_ID = "blabla"
AWS_SECRET_ACCESS_KEY = "blabla"

But i'm still getting errors for my Aws access key(he AWS Access Key Id you provided does not exist in our records).
When i check the permissions with s3cmd i can get files from the bucket and everything is ok.
Any idea?

Tests fail for no reason

bundle exec rspec returns 13 failures on master branch and also on old tags. Am I missing something when I run it like that?

Example output (another 11 tests fail too):

Failures:

  1) LogStash::Inputs::S3#list_new_files should sort return object sorted by last_modification date with older first
     Failure/Error: expect(config.list_new_files).to eq(['TWO_DAYS_AGO', 'YESTERDAY', 'TODAY'])

       expected: ["TWO_DAYS_AGO", "YESTERDAY", "TODAY"]
            got: ["TODAY"]

       (compared using ==)
     # ./spec/inputs/s3_spec.rb:124:in `(root)'

  2) LogStash::Inputs::S3#list_new_files should support not providing a exclude pattern
     Failure/Error: expect(config.list_new_files).to eq(objects_list.map(&:key))

       expected: ["exclude-this-file-1", "exclude/logstash", "this-should-be-present"]
            got: ["this-should-be-present"]

       (compared using ==)
     # ./spec/inputs/s3_spec.rb:69:in `(root)'

plugin fails to recognize new AWS region ap-south-1

tested with
logstash-2.3.4
logstash-2.2.2

OS: ArchLinux

Reading config file {:config_file=>"/usr/local/conf-logstash/elb.conf", :level=>:debug, :file=>"logstash/config/loader.rb", :line=>"69", :method=>"local_config"}
Plugin not defined in namespace, checking for plugin file {:type=>"input", :name=>"s3", :path=>"logstash/inputs/s3", :level=>:debug, :file=>"logstash/plugin.rb", :line=>"76", :method=>"lookup"}
Invalid setting for s3 input plugin:

  input {
    s3 {
      # This setting must be a ["us-east-1", "us-west-1", "us-west-2", "eu-central-1", "eu-west-1", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "sa-east-1", "us-gov-west-1", "cn-north-1"]
      # Expected one of ["us-east-1", "us-west-1", "us-west-2", "eu-central-1", "eu-west-1", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "sa-east-1", "us-gov-west-1", "cn-north-1"], got ["ap-south-1"]
      region => "ap-south-1"
      ...
    }
  } {:level=>:error, :file=>"logstash/config/mixin.rb", :line=>"374", :method=>"validate_check_parameter_values"}
Plugin not defined in namespace, checking for plugin file {:type=>"codec", :name=>"plain", :path=>"logstash/codecs/plain", :level=>:debug, :file=>"logstash/plugin.rb", :line=>"76", :method=>"lookup"}
config LogStash::Codecs::Plain/@charset = "UTF-8" {:level=>:debug, :file=>"logstash/config/mixin.rb", :line=>"153", :method=>"config_init"}
The given configuration is invalid. Reason: Something is wrong with your configuration. {:level=>:fatal, :file=>"logstash/agent.rb", :line=>"189", :method=>"execute"}

Looks like configtest itself is failing, how can I include this region? Do I have to update something?

Thanks.

S3 Input - "use_ssl" should be marked as deprecated.

Logstash 5.0
Linux
Input config:

input {
s3 {
bucket => "XXXXXX"
interval => 20 # seconds
prefix => "data"
backup_add_prefix => "old/logstash-"
backup_to_bucket => "XXXXX"
delete => true
region => "us-east-1"
type => "cloudfront"
codec => "plain"
secret_access_key => "XXXXXXXX"
access_key_id => "XXXXXX"
use_ssl => true
}
}

I'm seeing this error:
[ERROR][logstash.inputs.s3] Unknown setting 'use_ssl' for s3

I think SSL is the only option now, so it's not a software issue, but I suggest the documentation for "Logstash 5.0" mark this property as deprecated.

Cheers!

S3 directory handling

From LOGSTASH-1548

Can the s3 input plugin be upgraded to search specific folders?
For example, in the file input you can define a glob to search via.
like /var/log/apache2/access*
How would it be possible to do the same thing for s3 input?
Also be handy to define in the s3 output the same thing. So you could put the files into a specific >folder structure you define.

@ph Is it not what the prefix config is doing ?