cloudstax / firecamp Goto Github PK

Serverless Platform for the stateful services

License: Apache License 2.0

Go 85.91% Shell 7.91% Makefile 0.05% Ruby 3.87% Dockerfile 2.26%

aws cassandra cloud consul container couchdb docker ecs elasticsearch kafka mongodb postgresql redis serverless statefulservices swarm zookeeper

firecamp's People

Contributors

Stargazers

Watchers

firecamp's Issues

Create CloudWatch alarms to watch the available free space

Firecamp upgrade guide

Hi @JuniusLuo . Could you please publish Firecamp upgrade guide to the wiki? I'm running 0.9.6 and would like to upgrade to 1.0 but worrying about the correct procedure.

Typo in Installation wiki page

Don't know how to make a pull request for wiki pages. There's a small typo in Delete the Stateful Service - used create-service command instead of delete-service. Please, fix

New Kafka version

Hello. Can you please update the Kafka service to use the latest Kafka version - 2.0.0?

Wasn't able to build docker container for postgres

...
Reading package lists...
Building dependency tree...
Reading state information...
E: Version '9.6.6-1.pgdg80+1' for 'postgresql-9.6' was not found
E: Version '9.6.6-1.pgdg80+1' for 'postgresql-contrib-9.6' was not found
The command '/bin/sh -c apt-get update  && apt-get install -y postgresql-common         && sed -ri 's/#(create_main_cluster) .*$/\1 = false/' /etc/postgresql-common/createcluster.conf     && apt-get install -y           dnsutils                postgresql-$PG_MAJOR=$PG_VERSION                postgresql-contrib-$PG_MAJOR=$PG_VERSION    && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100
make: *** [docker] Error 100

Cassandra task placement constraint issue

In the test cluster I'm running, there's only one node in ASG. It was in AZ us-east-1a. Before going home yesterday I've updated ASG to 0 nodes. Today I've reverted ASG settings back to 1 node, but this time AWS has started the EC2 instance in AZ us-east-1b.
ECS has started firecamp-manageserver w/o issues, but it wasn't able to start C* due to task placement constraint appeared to be in AZ 1a:

I've checked Task definition for C* - it doesn't have any placement constraints set.

Do you know what's going on and how to get that fixed? I tried to update the C* service with "Force deployment" checkbox set - didn't help.

Add kafka-manager

Please, implement kafka-manager (https://github.com/yahoo/kafka-manager) into Kafka service creation.
A ready-to-use docker image - https://hub.docker.com/r/sheepkiller/kafka-manager/ - can be used. Probably the best way is to add a command line keys like -enable-kafka-manager and -kafka-manager-port=9000 to run it along with the kafka service.

wasn't able to build firecamp-cassandra-init docker contaniner

+ target=firecamp-cassandra-init
+ image=mydocker/firecamp-cassandra-init:3.11
+ path=/home/user/go/src/github.com/cloudstax/firecamp/catalog/cassandra/3.11/init-task-dockerfile/
+ cp /home/user/go/src/github.com/cloudstax/firecamp/catalog/waitdns.sh /home/user/go/src/github.com/cloudstax/firecamp/catalog/cassandra/3.11/init-task-dockerfile/
+ docker build -q -t mydocker/firecamp-cassandra-init:3.11 /home/user/go/src/github.com/cloudstax/firecamp/catalog/cassandra/3.11/init-task-dockerfile/
Sending build context to Docker daemon  6.656kB
Step 1/11 : FROM debian:jessie-backports
 ---> f48e88a3ad1f
Step 2/11 : RUN {       echo 'Package: openjdk-* ca-certificates-java';       echo 'Pin: release n=*-backports';       echo 'Pin-Priority: 990';     } > /etc/apt/preferences.d/java-backports
 ---> Running in 3d6f40fb4105
Removing intermediate container 3d6f40fb4105
 ---> 1b6e064f6aa5
Step 3/11 : ENV GPG_KEYS        514A2AD631A57A16DD0047EC749D6EEC0353B12C        A26E528B271F19B9E5D8E19EA278B781FE4B2BDA
 ---> Running in 1853bc38b07a
Removing intermediate container 1853bc38b07a
 ---> 4c2f20d37ef8
Step 4/11 : RUN set -ex;        export GNUPGHOME="$(mktemp -d)";        for key in $GPG_KEYS; do                gpg --keyserver ha.pool.sks-keyservers.net --recv-keys "$key";      done;   gpg --export $GPG_KEYS > /etc/apt/trusted.gpg.d/cassandra.gpg;  rm -r "$GNUPGHOME";     apt-key list
 ---> Running in 569cb039131a
+ mktemp -d
+ export GNUPGHOME=/tmp/tmp.cufASzNBoI
+ gpg --keyserver ha.pool.sks-keyservers.net --recv-keys 514A2AD631A57A16DD0047EC749D6EEC0353B12C
gpg: keyring `/tmp/tmp.cufASzNBoI/secring.gpg' created
gpg: keyring `/tmp/tmp.cufASzNBoI/pubring.gpg' created
gpg: requesting key 0353B12C from hkp server ha.pool.sks-keyservers.net
gpg: /tmp/tmp.cufASzNBoI/trustdb.gpg: trustdb created
gpg: key 0353B12C: public key "T Jake Luciani <[email protected]>" imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
+ gpg --keyserver ha.pool.sks-keyservers.net --recv-keys A26E528B271F19B9E5D8E19EA278B781FE4B2BDA
gpg: requesting key FE4B2BDA from hkp server ha.pool.sks-keyservers.net
gpg: key FE4B2BDA: public key "Michael Shuler <[email protected]>" imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
+ gpg --export 514A2AD631A57A16DD0047EC749D6EEC0353B12C A26E528B271F19B9E5D8E19EA278B781FE4B2BDA
+ rm -r /tmp/tmp.cufASzNBoI
+ apt-key list
/etc/apt/trusted.gpg.d/cassandra.gpg
------------------------------------
pub   4096R/0353B12C 2014-09-05
uid                  T Jake Luciani <[email protected]>
sub   4096R/D35F8215 2014-09-05

pub   4096R/FE4B2BDA 2009-07-15
uid                  Michael Shuler <[email protected]>
uid                  Michael Shuler <[email protected]>
sub   4096R/25A883ED 2009-07-15

/etc/apt/trusted.gpg.d/debian-archive-jessie-automatic.gpg
----------------------------------------------------------
pub   4096R/2B90D010 2014-11-21 [expires: 2022-11-19]
uid                  Debian Archive Automatic Signing Key (8/jessie) <[email protected]>

/etc/apt/trusted.gpg.d/debian-archive-jessie-security-automatic.gpg
-------------------------------------------------------------------
pub   4096R/C857C906 2014-11-21 [expires: 2022-11-19]
uid                  Debian Security Archive Automatic Signing Key (8/jessie) <[email protected]>

/etc/apt/trusted.gpg.d/debian-archive-jessie-stable.gpg
-------------------------------------------------------
pub   4096R/518E17E1 2013-08-17 [expires: 2021-08-15]
uid                  Jessie Stable Release Key <[email protected]>

/etc/apt/trusted.gpg.d/debian-archive-stretch-automatic.gpg
-----------------------------------------------------------
pub   4096R/F66AEC98 2017-05-22 [expires: 2025-05-20]
uid                  Debian Archive Automatic Signing Key (9/stretch) <[email protected]>
sub   4096R/B7D453EC 2017-05-22 [expires: 2025-05-20]

/etc/apt/trusted.gpg.d/debian-archive-stretch-security-automatic.gpg
--------------------------------------------------------------------
pub   4096R/8AE22BA9 2017-05-22 [expires: 2025-05-20]
uid                  Debian Security Archive Automatic Signing Key (9/stretch) <[email protected]>
sub   4096R/331F7F50 2017-05-22 [expires: 2025-05-20]

/etc/apt/trusted.gpg.d/debian-archive-stretch-stable.gpg
--------------------------------------------------------
pub   4096R/1A7B6500 2017-05-20 [expires: 2025-05-18]
uid                  Debian Stable Release Key (9/stretch) <[email protected]>

/etc/apt/trusted.gpg.d/debian-archive-wheezy-automatic.gpg
----------------------------------------------------------
pub   4096R/46925553 2012-04-27 [expires: 2020-04-25]
uid                  Debian Archive Automatic Signing Key (7.0/wheezy) <[email protected]>

/etc/apt/trusted.gpg.d/debian-archive-wheezy-stable.gpg
-------------------------------------------------------
pub   4096R/65FFB764 2012-05-08 [expires: 2019-05-07]
uid                  Wheezy Stable Release Key <[email protected]>

Removing intermediate container 569cb039131a
 ---> 83e8962b3a75
Step 5/11 : RUN echo 'deb http://www.apache.org/dist/cassandra/debian 311x main' >> /etc/apt/sources.list.d/cassandra.list
 ---> Running in cac39e19f098
Removing intermediate container cac39e19f098
 ---> 9fa7a6eac994
Step 6/11 : ENV CASSANDRA_VERSION 3.11.0
 ---> Running in 353390195efa
Removing intermediate container 353390195efa
 ---> 7494779b0b6a
Step 7/11 : RUN apt-get update  && apt-get install -y     curl     dnsutils     cassandra="$CASSANDRA_VERSION"     cassandra-tools="$CASSANDRA_VERSION"    && rm -rf /var/lib/apt/lists/*
 ---> Running in b9d7183129d4
Get:1 http://security.debian.org jessie/updates InRelease [63.1 kB]
Ign http://deb.debian.org jessie InRelease
Get:2 http://security.debian.org jessie/updates/main amd64 Packages [608 kB]
Get:3 http://deb.debian.org jessie-updates InRelease [145 kB]
Get:4 http://www.apache.org 311x InRelease [3169 B]
Get:5 http://deb.debian.org jessie-backports InRelease [166 kB]
Get:6 http://deb.debian.org jessie Release.gpg [2434 B]
Get:7 http://deb.debian.org jessie Release [148 kB]
Get:8 http://www.apache.org 311x/main amd64 Packages [686 B]
Get:9 http://deb.debian.org jessie-updates/main amd64 Packages [23.1 kB]
Get:10 http://deb.debian.org jessie-backports/main amd64 Packages [1172 kB]
Get:11 http://deb.debian.org jessie/main amd64 Packages [9064 kB]
Fetched 11.4 MB in 7s (1462 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Version '3.11.0' for 'cassandra' was not found
E: Version '3.11.0' for 'cassandra-tools' was not found
The command '/bin/sh -c apt-get update  && apt-get install -y     curl     dnsutils     cassandra="$CASSANDRA_VERSION"     cassandra-tools="$CASSANDRA_VERSION"     && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100
make: *** [docker] Error 100

No issues with firecamp-cassandra.

Cassandra: incremental backup

$ cat cassandra.yaml
...
incremental_backups: false

Is it false intentionally?

Kafka logs flooding

Hi. Found a lot of such messages in the logs (CloudWatch):

[2017-11-29 22:43:58,934] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)

Is there an option to stop flooding like this?

Implement update-service operation in firecamp-service-cli

It would be greate to have update-service operation in the cli tool to manage amount of replicas, volumes size, heap size, etc

init.sh script cannot be loaded from S3

Nodes in the ECSClusterStack-ServiceAutoScalingGroup don't seem being able to fech the init.sh file from S3. Here's the content of my cloud-init-output.log:

Loaded plugins: priorities, update-motd, upgrade-helper
Package aws-cfn-bootstrap-1.4-26.17.amzn1.noarch already installed and latest version
Nothing to do
+ version=0.9
+ aws s3 cp s3://cloudstax/firecamp/releases/0.9/scripts/init.sh /tmp/init.sh
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
Nov 21 03:36:52 cloud-init[2813]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Nov 21 03:36:52 cloud-init[2813]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Nov 21 03:36:52 cloud-init[2813]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 0.7.6 finished at Tue, 21 Nov 2017 03:36:52 +0000. Datasource DataSourceEc2.  Up 53.08 seconds

When I tried running the S3 command manually, I got the same exception:

[root@ip-10-0-35-248 tmp]# aws s3 cp s3://cloudstax/firecamp/releases/0.9/scripts/init.sh /tmp/init.sh
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Kafka Zookeeper logs show connection refused

I used Cloudformation template firecamp-existingvpc to roll out the environment. Then created a zookeeper service according to the instruction (for Kafka). Here is what is in the Cloudwatch (firecamp-qa-zoo-qa) logs:

2017-12-01 08:20:31,031 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumPeer$QuorumServer@167] - Resolved hostname: zoo-qa-1.firecamp-qa-firecamp.com to address: zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62
2017-12-01 08:21:31,029 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager$Listener@746] - Received connection request /172.22.2.62:59352
2017-12-01 08:21:31,030 [myid:3] - WARN [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager@588] - Cannot open channel to 2 at election address zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:479)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:379)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:757)
2017-12-01 08:21:31,033 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumPeer$QuorumServer@167] - Resolved hostname: zoo-qa-1.firecamp-qa-firecamp.com to address: zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62
2017-12-01 08:22:31,031 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager$Listener@746] - Received connection request /172.22.2.62:59356
2017-12-01 08:22:31,032 [myid:3] - WARN [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager@588] - Cannot open channel to 2 at election address zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:479)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:379)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:757)

Is that OK?

I see that 172.22.2.62 has an established connection with 3rd zookeeper instance (1.119):

[root@ip-172-22-2-62 ec2-user]# netstat -anp|grep 3888
tcp        0      0 ::ffff:172.22.2.62:59908    ::ffff:172.22.5.201:3888    TIME_WAIT   -
tcp        0      0 ::ffff:172.22.2.62:52050    ::ffff:172.22.1.119:3888    ESTABLISHED 3637/java

How to change the DNS zone name?

Cassandra: OpenJDK vs Oracle Java

There is a warning in the log:

WARN [main] 2018-02-13 10:16:44,908 StartupChecks.java:203 - OpenJDK is not recommended. Please upgrade to the newest Oracle Java release

Do you think we need to switch to Oracle Java?

no service type check

I accidentally forgot to change the service name and it still worked:

$ $PREFIX/firecamp-service-cli -op=stop-service -service-type=kafka -region=us-east-1 -cluster=firecamp-$ENV -service-name=zoo-$ENV
Service stopped

(Notice type=kafka and name=zoo)

firecamp-service-cli ignores memory parameters

running the firecamp.template
logging to bastion host
running firecamp-service-cli with -reserve-memory=768
still the service task definition has: "memoryReservation": 1024,
And when testing vs t2.micro there isn't left 1024MB

Telegraf service creation issues

Couple issues:

I used wrong value for max-memory, got the error but the service was created nevertheless:

# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -service-type=telegraf -service-name=telegraf-cass -tel-monitor-service-name=cass-prod -max-memory=100
2018-03-19 16:48:11.758071617 +0000 UTC create service error InternalError: ClientException: Invalid setting for container 'firecamp-prod-telegraf-cass-container'. 'memory' must be greater than or equal to 'memoryReservation'.
        status code: 400, request id: 4f6dd6d9-2b95-11e8-b3a3-eb68ba90531c

# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -service-type=telegraf -service-name=telegraf-cass -tel-monitor-service-name=cass-prod
2018-03-19 16:49:31.649930624 +0000 UTC create service error ServiceExist: Service exists

Help info is incorrect for Telegraf service:

# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -help
Usage: firecamp-service-cli -op=create-service
  -region string
        The AWS region
  -cluster string
        The cluster name. Can only contain letters, numbers, or hyphens. default: mycluster
  -service-type string
        The catalog service type: mongodb|postgresql|cassandra|zookeeper|kafka|kafkamanager|redis|couchdb|consul|elasticsearch|kibana|logstash|telegraf
  -service-name string
        The service name. Can only contain letters, numbers, or hyphens. The max length is 58
  -max-cpuunits int
        The max number of cpu units for the container
  -reserve-cpuunits int
        The number of cpu units to reserve for the container. default: 256
  -max-memory int
        The max memory for the container, unit: MB
  -reserve-memory int
        The memory reserved for the container, unit: MB. default: 256
  -volume-type string
        The EBS volume type: gp2|io1|st1. default: gp2
  -volume-size int
        The size of each EBS volume, unit: GB
  -volume-iops int
        The EBS volume Iops when io1 type is chosen, otherwise ignored. default: 100
  -volume-encrypted
        whether to create encrypted volume. default: false

It doesn't display the required option -tel-monitor-service-name, while showing volume options which are definitely out of scope.

Cassandra/Kafka usage question

Hello. Not sure it's a good place to ask questions, but you might move this Q/A to the wiki section if you find it relevant.
What is the best way to get connected to the services like Cassandra/Kafka from other applications (Java code, for example)? I need to give my developers IP addresses of the service endpoints, but since the EC2 instances might be terminated and started new ones at any time, that IP addresses will be changed and an application will lose access to the service.

Proper way to restart services

Hello. This might be a dumb question, I apologies. What is the correct way to restart a service (in terms of the cluster). For example, how to restart Kafka containers safely?

firecamp-autoscalegroup.template does not use AMI ID specified in the parent template (firecamp-existingvpc.template)

Please, fix

Kafka issue during a worker instance outage

Hello,

Some time ago an alert came from our monitoring system that showed kafka service is not available. I looked at EC2 console and found that one of 3 firecamp brokers has alarm for Instance Status Checks. Wondering why that led to completely inaccessible Kafka service.

Here is how Kafka is checked from the monitoring host:

# /bin/docker run --rm harisekhon/cassandra-dev check_kafka.pl -B kafka-uat-0.firecamp-uat-firecamp.com:9092,kafka-uat-1.firecamp-uat-firecamp.com:9092,kafka-uat-2.firecamp-uat-firecamp.com:9092 -T testtopic -vvv
verbose mode on

check_kafka.pl version 0.3  =>  Hari Sekhon Utils version 1.18.9

broker host:              kafka-uat-0.firecamp-uat-firecamp.com
broker port:              9092
broker host:              kafka-uat-1.firecamp-uat-firecamp.com
broker port:              9092
broker host:              kafka-uat-2.firecamp-uat-firecamp.com
broker port:              9092
host:                     kafka-uat-0.firecamp-uat-firecamp.com
port:                     9092
topic:                    testtopic
required acks:            1
send-max-attempts:        1
receive-max-attempts:     1
retry-backoff:            200
sleep:                    0.5

setting timeout to 10 secs

connecting to Kafka brokers kafka-uat-0.firecamp-uat-firecamp.com:9092,kafka-uat-1.firecamp-uat-firecamp.com:9092,kafka-uat-2.firecamp-uat-firecamp.com:9092
CRITICAL: Error: Cannot get metadata: topic='<undef>'

Trace begun at /usr/local/share/perl5/site_perl/Kafka/Connection.pm line 1592
Kafka::Connection::_error('Kafka::Connection=HASH(0x55caf194e5a0)', -1007, 'topic=\'<undef>\'') called at /usr/local/share/perl5/site_perl/Kafka/Connection.pm line 693
Kafka::Connection::get_metadata('Kafka::Connection=HASH(0x55caf194e5a0)') called at /github/nagios-plugins/check_kafka.pl line 257
main::__ANON__ at /github/nagios-plugins/lib/HariSekhonUtils.pm line 565
eval {...} at /github/nagios-plugins/lib/HariSekhonUtils.pm line 565
HariSekhonUtils::try('CODE(0x55caf19559d8)') called at /github/nagios-plugins/check_kafka.pl line 383

kafka-uat.log.gz
zoo-uat.log.gz
Zookeeper and Kafka logs attached.

After some time it's all got back to working state, but no Kafka service worked during ~15 minutes.

Please, take a look and let me know if you need anything else.

Tasks can't be started on an EC2 instance

There was an issue with one of EC2 instances, so I've terminated it and the ASG has started new one. For some reason, the containers are not starting up on the new instance. They fail with (from /var/log/docker):

time="2018-03-02T10:07:55Z" level=info msg="2018/03/02 10:07:55 http: panic serving @: runtime error: invalid memory address or nil pointer dereference" plugin=3c95129f659d2d162550065c4200980c15d8d2ce25c002c9f01f96c84f3ea636
time="2018-03-02T10:07:55.565099089Z" level=warning msg="Unable to connect to plugin: /run/docker/plugins/3c95129f659d2d162550065c4200980c15d8d2ce25c002c9f01f96c84f3ea636/firecampvol.sock/VolumeDriver.Mount: Post http://%2Frun%2Fdocker%2Fplugins%2F3c95129f659d2d162550065c4200980c15d8d2ce25c002c9f01f96c84f3ea636%2Ffirecampvol.sock/VolumeDriver.Mount: EOF, retrying in 1s"

So, no volumes are mounted.

[root@ip-172-22-2-212 log]# docker ps
CONTAINER ID        IMAGE                                        COMMAND             CREATED             STATUS              PORTS               NAMES
634ef2c0ea2f        cloudstax/firecamp-amazon-ecs-agent:latest   "/agent"            5 hours ago         Up 5 hours                              ecs-agent
[root@ip-172-22-2-212 log]# docker plugin ls
ID                  NAME                               DESCRIPTION                                     ENABLED
3c95129f659d        cloudstax/firecamp-volume:latest   firecamp volume plugin for docker               true
656134559eb0        cloudstax/firecamp-log:latest      firecamp log plugin for docker: consume lo...   true

The only thing I did against this cluster recently was killing firecamp-manageserver task to make it updated to the latest.
Other two cluster nodes work w/o issues.
The only difference I see is the agent version:

grep Agent /var/log/firecamp/firecamp-dockervolume.INFO

"Amazon ECS Agent - v1.16.2 (*55b7b5f)" - at the new (non-working) instance
"Amazon ECS Agent - v1.16.0 (*e24ae08)" - at working instances

Please, help me to figure that out!

0.9.2 release fails to launch in AWS

Used template firecamp-existingvpc.template. At some point cloudformation stack creation fails at the ASG initialization. Was able to find what's going on:

[root@ip-172-31-38-80 tmp]# docker plugin install --grant-all-permissions cloudstax/firecamp-log:0.9.2 CLUSTER="firecamp-prod"
0.9.2: Pulling from cloudstax/firecamp-log
481765f73fa2: Download complete
Digest: sha256:fb14fb10d55f7e78d65b1e874ee036844a0f5552074c9c94b1715695286e723a
Status: Downloaded newer image for cloudstax/firecamp-log:0.9.2
Error response from daemon: setting "CLUSTER" not found in the plugin configuration

The latest release installs w/o issues:

[root@ip-172-31-38-80 tmp]# docker plugin install --grant-all-permissions cloudstax/firecamp-log:latest CLUSTER="firecamp-prod"
latest: Pulling from cloudstax/firecamp-log
1c827c905aed: Download complete
Digest: sha256:809700ced49e4b477f5ff42f9a6ec4bdb3e649982bf91913f3f865afad932a2d
Status: Downloaded newer image for cloudstax/firecamp-log:latest
Installed plugin cloudstax/firecamp-log:latest

Please, fix

Parameters Allowed Pattern bug

You have set for vpc and subnet
subnet-[0-9a-z]{8}
which is not corrent since they may be longer than 8 chars
change to subnet-[0-9a-z]{8,}
or anything else that would work
is can't be run now

Having trouble building firecamp-service-cli 0.9.0 from source

The S3 CLI link as stated on master's README.md uses CLI version 0.8.0; however, following through the tutorial on setting up a Cassandra cluster, there's a flag that only gets supposed in the 0.9.0 release: -journal-volume-size

firecamp-service-cli -op=create-service -service-type=cassandra -region=us-east-1 -cluster=t1 -service-name=mycas -replicas=3 -volume-size=100 -journal-volume-size=10

I've been trying to build the CLI from source on my local machine yet have been seeing issues with the aws/session package:

root@8d9d88564e5c:/usr/src/firecamp-service-cli# go build -v
_/usr/src/firecamp-service-cli
# _/usr/src/firecamp-service-cli
./main.go:1182:36: cannot use sess (type *"github.com/aws/aws-sdk-go/aws/session".Session) as type *"github.com/cloudstax/firecamp/vendor/github.com/aws/aws-sdk-go/aws/session".Session in argument to awsroute53.NewAWSRoute53

@JuniusLuo can you help with providing the latest cli executable or let me know how I'd fix the issue above? I came across Firecamp last night and this is a major roadblock. Thanks.

Cassandra replica restoration

After some manipulations it appears volumes of a Cassandra replica were accidentally deleted. At least, I see the following in the firecamp logs (/var/log/firecamp/firecamp-dockervolume.ERROR) on one of EC2 instances:

E1212 12:36:32.039874      13 volume.go:851] detach journal volume from last owner error NotFound requuid 172.22.5.224-bda1319c0a71481456f7689bb2b61571-1513082191 {vol-00fd036927e65754a /dev/xvdj vol-0d1b18609d7b32e1c /dev/xvdk} &{bda1319c0a71481456f7689bb2b61571 2 cass-qa-2 us-east-1c arn:aws:ecs:us-east-1:ID:task/e36d526c-1007-4cf4-a3ca-ff962674c632 arn:aws:ecs:us-east-1:ID:container-instance/f822e87a-47c1-4a68-a8e8-9ccbe23e9009 i-0bd3125d1e463d369 1513070833991537870 {vol-00fd036927e65754a /dev/xvdj vol-0d1b18609d7b32e1c /dev/xvdk} 127.0.0.1 [0xc4202eaa20 0xc4202eaa80 0xc4202eaab0 0xc4202eab10]}
E1212 12:36:32.039896      13 volume.go:729] Mount failed, get service member error NotFound, serviceUUID bda1319c0a71481456f7689bb2b61571, requuid 172.22.5.224-bda1319c0a71481456f7689bb2b61571-1513082191
E1212 12:36:43.873859      13 ec2.go:222] failed to DescribeVolumes vol-0d1b18609d7b32e1c error InvalidVolume.NotFound: The volume 'vol-0d1b18609d7b32e1c' does not exist.
        status code: 400, request id: a3acc2b9-f47a-4ec2-8364-74b627cc89c0 requuid 172.22.5.224-bda1319c0a71481456f7689bb2b61571-1513082203
        E1212 12:36:43.873876      13 ec2.go:177] GetVolumeInfo vol-0d1b18609d7b32e1c error InvalidVolume.NotFound: The volume 'vol-0d1b18609d7b32e1c' does not exist.
                status code: 400, request id: a3acc2b9-f47a-4ec2-8364-74b627cc89c0 requuid 172.22.5.224-bda1319c0a71481456f7689bb2b61571-1513082203
                E1212 12:36:43.873884      13 ec2.go:162] GetVolumeState vol-0d1b18609d7b32e1c error InvalidVolume.NotFound: The volume 'vol-0d1b18609d7b32e1c' does not exist.
                        status code: 400, request id: a3acc2b9-f47a-4ec2-8364-74b627cc89c0 requuid 172.22.5.224-bda1319c0a71481456f7689bb2b61571-1513082203
                        E1212 12:36:43.873893      13 volume.go:1227] GetVolumeState error NotFound volume vol-0d1b18609d7b32e1c ServerInstanceID i-0bd3125d1e463d369 device /dev/xvdk requuid 172.22.5.224-bda1319c0a71481456f7689bb2b61571-1513082203

This leads to the following task event:

Status reason | CannotStartContainerError:  API error (500): error while mounting volume  '/var/lib/docker/plugins/4f11459ccd04e2f94009d96f631266758d8c3bc4fb120e1f9376a9bd568c1792/rootfs':  VolumeDriver.Mount: Mount failed, get service member error NotFound,  serviceUUID bda

Is there a way to re-create the failed replica without re-launching the whole Cassandra service from scratch?

Cassandra: vm.max_map_count

WARN [main] 2018-02-13 10:16:44,919 StartupChecks.java:271 - Maximum number of memory map areas per process (vm.max_map_count) 262144 is too low, recommended value: 1048575, you can change it with sysctl.

Is that OK to change the default to 1048575?

unable to build zookeeper docker container

+ target=firecamp-zookeeper
+ image=mydocker/firecamp-zookeeper:3.4
+ path=/home/user/go/src/github.com/cloudstax/firecamp/catalog/zookeeper/3.4/dockerfile/
+ docker build -q -t mydocker/firecamp-zookeeper:3.4 /home/user/go/src/github.com/cloudstax/firecamp/catalog/zookeeper/3.4/dockerfile/
Sending build context to Docker daemon  12.29kB
Step 1/12 : FROM openjdk:8-jre-alpine
8-jre-alpine: Pulling from library/openjdk
ff3a5c916c92: Pull complete
5de5f69f42d7: Pull complete
fa7536dd895a: Pull complete
Digest: sha256:d3468b0fab294db03b4a67cabdaccf9c47a635ad14429ad43a0cce522e1ca8b3
Status: Downloaded newer image for openjdk:8-jre-alpine
 ---> b1bd879ca9b3
Step 2/12 : RUN apk add --no-cache   bash   su-exec
 ---> Running in ecf20e16270f
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
(1/7) Installing pkgconf (1.3.10-r0)
(2/7) Installing ncurses-terminfo-base (6.0_p20171125-r0)
(3/7) Installing ncurses-terminfo (6.0_p20171125-r0)
(4/7) Installing ncurses-libs (6.0_p20171125-r0)
(5/7) Installing readline (7.0.003-r0)
(6/7) Installing bash (4.4.12-r2)
Executing bash-4.4.12-r2.post-install
(7/7) Installing su-exec (0.2-r0)
Executing busybox-1.27.2-r7.trigger
OK: 90 MiB in 57 packages
Removing intermediate container ecf20e16270f
 ---> 65d417967c5c
Step 3/12 : ENV ZOO_USER=zookeeper
 ---> Running in 00c0ddf63307
Removing intermediate container 00c0ddf63307
 ---> d9897087278f
Step 4/12 : RUN set -x   && adduser -D "$ZOO_USER"
 ---> Running in 3fccfad7a6e0
+ adduser -D zookeeper
Removing intermediate container 3fccfad7a6e0
 ---> 56fb9f7d254e
Step 5/12 : ARG GPG_KEY=C823E3E5B12AF29C67F81976F5CECB3CB5E9BD2D
 ---> Running in 1f4fcb41f88f
Removing intermediate container 1f4fcb41f88f
 ---> 0bd92d063a3b
Step 6/12 : ARG DISTRO_NAME=zookeeper-3.4.10
 ---> Running in 33b3fe94ede7
Removing intermediate container 33b3fe94ede7
 ---> 325071bad640
Step 7/12 : RUN set -x   && apk add --no-cache --virtual .build-deps      gnupg   && wget -q "http://www.apache.org/dist/zookeeper/$DISTRO_NAME/$DISTRO_NAME.tar.gz"   && wget -q "http://www.apache.org/dist/zookeeper/$DISTRO_NAME/$DISTRO_NAME.tar.gz.asc"   && export GNUPGHOME="$(mktemp -d)"   && gpg --keyserver ha.pool.sks-keyservers.net --recv-key "$GPG_KEY"   && gpg --batch --verify "$DISTRO_NAME.tar.gz.asc" "$DISTRO_NAME.tar.gz"   && tar -xzf "$DISTRO_NAME.tar.gz"   && rm -r "$GNUPGHOME" "$DISTRO_NAME.tar.gz" "$DISTRO_NAME.tar.gz.asc"   && apk del .build-deps
 ---> Running in cc737c771109
+ apk add --no-cache --virtual .build-deps gnupg
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
(1/16) Installing libgpg-error (1.27-r1)
(2/16) Installing libassuan (2.4.4-r0)
(3/16) Installing libcap (2.25-r1)
(4/16) Installing pinentry (1.0.0-r0)
Executing pinentry-1.0.0-r0.post-install
(5/16) Installing libgcrypt (1.8.1-r0)
(6/16) Installing gmp (6.1.2-r1)
(7/16) Installing nettle (3.3-r0)
(8/16) Installing libunistring (0.9.7-r0)
(9/16) Installing gnutls (3.6.1-r0)
(10/16) Installing libksba (1.3.5-r0)
(11/16) Installing db (5.3.28-r0)
(12/16) Installing libsasl (2.1.26-r11)
(13/16) Installing libldap (2.4.45-r3)
(14/16) Installing npth (1.5-r1)
(15/16) Installing gnupg (2.2.3-r0)
(16/16) Installing .build-deps (0)
Executing busybox-1.27.2-r7.trigger
OK: 102 MiB in 73 packages
+ wget -q http://www.apache.org/dist/zookeeper/zookeeper-3.4.10/zookeeper-3.4.10.tar.gz
+ wget -q http://www.apache.org/dist/zookeeper/zookeeper-3.4.10/zookeeper-3.4.10.tar.gz.asc
+ mktemp -d
+ export GNUPGHOME=/tmp/tmp.JaNImj
+ gpg --keyserver ha.pool.sks-keyservers.net --recv-key C823E3E5B12AF29C67F81976F5CECB3CB5E9BD2D
gpg: keybox '/tmp/tmp.JaNImj/pubring.kbx' created
gpg: /tmp/tmp.JaNImj/trustdb.gpg: trustdb created
gpg: key F5CECB3CB5E9BD2D: public key "Rakesh Radhakrishnan (CODE SIGNING KEY) <[email protected]>" imported
gpg: Total number processed: 1
gpg:               imported: 1
+ gpg --batch --verify zookeeper-3.4.10.tar.gz.asc zookeeper-3.4.10.tar.gz
gpg: Signature made Thu Mar 23 11:45:03 2017 UTC
gpg:                using RSA key F5CECB3CB5E9BD2D
gpg: BAD signature from "Rakesh Radhakrishnan (CODE SIGNING KEY) <[email protected]>" [unknown]
The command '/bin/sh -c set -x   && apk add --no-cache --virtual .build-deps      gnupg   && wget -q "http://www.apache.org/dist/zookeeper/$DISTRO_NAME/$DISTRO_NAME.tar.gz"   && wget -q "http://www.apache.org/dist/zookeeper/$DISTRO_NAME/$DISTRO_NAME.tar.gz.asc"   && export GNUPGHOME="$(mktemp -d)"   && gpg --keyserver ha.pool.sks-keyservers.net --recv-key "$GPG_KEY"   && gpg --batch --verify "$DISTRO_NAME.tar.gz.asc" "$DISTRO_NAME.tar.gz"   && tar -xzf "$DISTRO_NAME.tar.gz"   && rm -r "$GNUPGHOME" "$DISTRO_NAME.tar.gz" "$DISTRO_NAME.tar.gz.asc"   && apk del .build-deps' returned a non-zero code: 1
make: *** [docker] Error 1

Mount Failed

Hi There,

I'm trying to spin up the zookeeper service with three replicas and the service is only deploying two with the third throwing errors for not being able to mount the volume. I've confirmed the ebs volume was created and available. I deleted the service and terminated the bad node, tried again once the ASG spun a new one up and redeployed the zookeeper service... same error happened.

Please let me know if there's any more info i can provide to help identify where the issue is happening and if it's something i need to change on my end. I'm using the normal cloud formation template in aws with three nodes, one in each of my defined three availability zones.

Thank you

OS: Linux ip-10-0-43-217 4.9.81-35.56.amzn1.x86_64 #1 SMP Fri Feb 16 00:18:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Firecamp volume error log

E0315 18:24:34.022496       6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} }

E0315 18:24:34.022513       6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273

ecs-agent error

2018-03-15T18:29:55Z [INFO] TaskHandler: batching container event: arn:aws:ecs:us-east-1:050772179124:task/4e24694b-02f9-46fe-9714-e4cce7f7a900 firecamp-stage-firecamp-stage-zookeeper-container -> STOPPED, Reason CannotStartContainerError: API error (500): error while mounting volume '/var/lib/docker/plugins/0bb436c154f10d5a0318180d992dfaf0f66dec1cbd8e1d83a8fb1888e8e3ccf1/rootfs': VolumeDriver.Mount: Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138595
, Known Sent: NONE

2018-03-15T18:29:55Z [INFO] TaskHandler: Adding event: TaskChange: [arn:aws:ecs:us-east-1:050772179124:task/4e24694b-02f9-46fe-9714-e4cce7f7a900 -> STOPPED, Known Sent: NONE, PullStartedAt: 2018-03-15 18:29:55.284603066 +0000 UTC, PullStoppedAt: 2018-03-15 18:29:55.39755933 +0000 UTC, ExecutionStoppedAt: 2018-03-15 18:29:55.604614019 +0000 UTC, arn:aws:ecs:us-east-1:050772179124:task/4e24694b-02f9-46fe-9714-e4cce7f7a900 firecamp-stage-firecamp-stage-zookeeper-container -> STOPPED, Reason CannotStartContainerError: API error (500): error while mounting volume '/var/lib/docker/plugins/0bb436c154f10d5a0318180d992dfaf0f66dec1cbd8e1d83a8fb1888e8e3ccf1/rootfs': VolumeDriver.Mount: Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138595

create cassandra service error EOF

# ./firecamp-service-cli -op=create-service -service-type=cassandra -region=us-east-1 -cluster=test-fc -service-name=cass-test-fc -replicas=1 -volume-size=10 -journal-volume-size=1
create cassandra service error EOF

Cassandra service is starting though w/o issues. No such issues with Zookeeper.
The cli and manageserver are the latest.

worker volumes are not tagged

Please, fix

Long time to launch tasks after an EC2 instance failure

EC2 instance has failed (AWS issue), so the firecamp's ASG has it terminated and fired up another one. After it came up, no tasks were able to start up. The error in the AWS ECS console is:

Status reason | CannotStartContainerError:  API error (500): error while mounting volume  '/var/lib/docker/plugins/2ec1ac405b2314e7a06c414ab0323a74187b49f1b9e9d7dcefb670bff13f599d/rootfs':  VolumeDriver.Mount: Mount failed, get service member error Timeout,  serviceUUID d5cc

Fortunately, after some time (~37 minutes) the tasks have been started w/o any interactions from my side.

Not sure, but the reason might be connected with long time the failed instance were in shutting-down state.

unable to build kafkamanager docker image

It fails with:

$ pwd
/home/user/go/src/github.com/cloudstax/firecamp/catalog/kafkamanager/1.3.3/dockerfile
$ docker build -t 111111111111.dkr.ecr.us-east-1.amazonaws.com/firecamp-kafka-manager:1.3.3-1.0 .
Sending build context to Docker daemon  11.57MB
Step 1/10 : FROM debian:jessie-backports
 ---> 3c66f9166174
...
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::              FAILED DOWNLOADS            ::
[warn]  :: ^ see resolution messages for details  ^ ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.xerial.snappy#snappy-java;1.1.7.1!snappy-java.jar
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[info] Wrote /tmp/kafka-manager/target/scala-2.11/kafka-manager_2.11-1.3.3.18.pom
sbt.ResolveException: download failed: org.xerial.snappy#snappy-java;1.1.7.1!snappy-java.jar'
...

Evert time it fails to download different files.

No jmx user/password set during Kafka service creation

$ PREFIX=0.9.6
$ $PREFIX/firecamp-service-cli -op=create-service -service-type=kafka -region=us-east-1 -cluster=firecamp-$ENV -replicas=3 -volume-size=$VOLSZ -service-name=kafka-$ENV -kafka-zk-service=zoo-$ENV -kafka-heap-size=512
The Kafka heap size is less than 6144. Please increase it for production system
2018-05-18 11:00:12.399656608 +0000 UTC The kafka service is created, jmx user  password
...

make fails

I was trying to build firecamp from sources and got

this:

$ make
./scripts/install.sh
+ protoc -I db/controldb/protocols/ db/controldb/protocols/controldb.proto --go_out=plugins=grpc:db/controldb/protocols
+ cd syssvc/firecamp-controldb
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/firecamp-dockervolume
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/firecamp-dockerlog
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/firecamp-manageserver
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/firecamp-service-cli
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/firecamp-swarminit
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd /home/user/go/bin
+ tar -zcf firecamp-service-cli.tgz firecamp-service-cli
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd /home/user/go/bin
+ tar -zcf firecamp-swarminit.tgz firecamp-swarminit
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd containersvc/k8s/firecamp-initcontainer/
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd containersvc/k8s/firecamp-stopcontainer/
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/tools/firecamp-volume-replace
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd /home/user/go/bin
+ tar -zcf firecamp-volume-replace.tgz firecamp-volume-replace
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/examples/firecamp-cleanup
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
+ cd syssvc/examples/firecamp-service-creation-example
+ go install
+ cd -
/home/user/go/src/github.com/cloudstax/firecamp
./scripts/builddocker.sh latest all
+ set -e
++ pwd
+ export TOPWD=/home/user/go/src/github.com/cloudstax/firecamp
+ TOPWD=/home/user/go/src/github.com/cloudstax/firecamp
+ version=latest
+ buildtarget=all
+ org=cloudstax/
+ system=firecamp
+ '[' all = all ']'
+ BuildPlugin
+ path=/home/user/go/src/github.com/cloudstax/firecamp/scripts/plugin-dockerfile
+ target=firecamp-pluginbuild
+ image=cloudstax/firecamp-pluginbuild
+ echo '### docker build: builder image'
### docker build: builder image
+ docker build -q -t cloudstax/firecamp-pluginbuild /home/user/go/src/github.com/cloudstax/firecamp/scripts/plugin-dockerfile
sha256:3012744b0ef7ec940803657b6e665b201f2c01395bf3d76af7248ca8cd25aca2
+ echo '### docker run: builder image with source code dir mounted'
### docker run: builder image with source code dir mounted
+ containername=firecamp-buildtest
+ docker rm firecamp-buildtest
Error: No such container: firecamp-buildtest
+ true
+ docker run --name firecamp-buildtest -v /home/user/go/src/github.com/cloudstax/firecamp:/go/src/github.com/cloudstax/firecamp cloudstax/firecamp-pluginbuild
total 4
drwxr-xr-x    1 root     root            22 Jan 30 10:22 .
drwxr-xr-x    1 root     root            23 Jan 30 10:21 ..
drwxr-xr-x   19 556      500           4096 Jan 30 10:19 firecamp
build firecamp-dockervolume
build firecamp-dockerlog
firecamp-dockerlog
firecamp-dockervolume
+ volumePluginPath=/home/user/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-dockervolume/dockerfile
+ volumePluginImage=cloudstax/firecamp-volume
+ echo '### docker build: rootfs image with firecamp-dockervolume'
### docker build: rootfs image with firecamp-dockervolume
+ docker cp firecamp-buildtest:/go/bin/firecamp-dockervolume /home/user/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-dockervolume/dockerfile
+ docker build -q -t cloudstax/firecamp-volume:rootfs /home/user/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-dockervolume/dockerfile
sha256:36cd3e91de833750a4c2c7174e32adee196625134ae16c1a73b390b99a036be0
+ rm -f /home/user/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-dockervolume/dockerfile/firecamp-dockervolume
+ echo '### create the plugin rootfs directory'
### create the plugin rootfs directory
+ volumePluginBuildPath=/home/user/go/src/github.com/cloudstax/firecamp/build/volumeplugin
+ mkdir -p /home/user/go/src/github.com/cloudstax/firecamp/build/volumeplugin/rootfs
+ docker rm -vf tmp
Error: No such container: tmp
+ true
+ docker create --name tmp cloudstax/firecamp-volume:rootfs
a9f683af85aa3d35c5fdd703c8b4fb463ce693a596c367a205f762943ee5752d
+ tar -x -C /home/user/go/src/github.com/cloudstax/firecamp/build/volumeplugin/rootfs
+ docker export tmp
+ cp /home/user/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-dockervolume/config.json /home/user/go/src/github.com/cloudstax/firecamp/build/volumeplugin
+ docker rm -vf tmp
tmp
+ echo '### create new plugin cloudstax/firecamp-volume:latest'
### create new plugin cloudstax/firecamp-volume:latest
+ docker plugin rm -f cloudstax/firecamp-volume:latest
Error: No such plugin: cloudstax/firecamp-volume:latest
+ true
+ docker plugin create cloudstax/firecamp-volume:latest /home/user/go/src/github.com/cloudstax/firecamp/build/volumeplugin
cloudstax/firecamp-volume:latest
+ docker plugin push cloudstax/firecamp-volume:latest
The push refers to repository [docker.io/cloudstax/firecamp-volume]
01ca22324601: Preparing
denied: requested access to the resource is denied
make: *** [docker] Error 1

Please, advise!

auto-limit java heap size

Let's add these keys to JVM settings for all services:

-XX:+PrintCommandLineFlags -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap

Thus we wouldn't need to limit the java heap size, it will be done automatically by reading container limits.

BUILD_YOUR_IMAGES wrong instructions

1. Build your own ecs agent docker image.

Checkout cloudstax amazon-ecs-agent branch, git clone https://github.com/cloudstax/amazon-ecs-agent.git, and change the "org" in Makefile and agent/engine/f
irecamp_task_engine.go to "mydockeraccount/". Then simply 'make' to build and upload the docker image.

There's no "org" defined at all:

$ pwd
/home/user/go/src/github.com/cloudstax/amazon-ecs-agent

$ grep org Makefile agent/engine/firecamp_task_engine.go
Makefile:       go get golang.org/x/tools/cmd/cover
Makefile:       go get golang.org/x/tools/cmd/goimports

$ grep cloudstax -r *
agent/engine/firecamp_task_engine.go:// Define here again to avoid the dependency on githut.com/cloudstax/firecamp
agent/engine/firecamp_task_engine.go:   volumeDriver        = "cloudstax/firecamp-volume"
agent/engine/firecamp_task_engine.go:   logDriver           = "cloudstax/firecamp-log"
Makefile:       @docker build -f scripts/dockerfiles/Dockerfile.release -t "cloudstax/firecamp-amazon-ecs-agent:latest" .
Makefile:       @echo "Built Docker image \"cloudstax/firecamp-amazon-ecs-agent:latest\""

firecamp-volume-replace is missing for non-latest releases

$ wget https://s3.amazonaws.com/cloudstax/firecamp/releases/0.9.6/packages/firecamp-volume-replace.tgz   --2018-05-17 16:36:13--  https://s3.amazonaws.com/cloudstax/firecamp/releases/0.9.6/packages/firecamp-volume-replace.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.134.53
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.134.53|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2018-05-17 16:36:13 ERROR 403: Forbidden.

Please, upload!

make docker fails for firecamp-cassandra-init

firecamp-cassandra-init.docker-build.log.gz

AWS Quickstart Failing

I am likely missing something, however I am getting a failed stack build for these errors:

	Embedded stack arn:aws:cloudformation:xxxx/CloudStax-FireCamp-VPCStack-xxxx/xxxxxx was not successfully created: The following resource(s) failed to create: [NATGateway3, NATGateway2, NATGateway1].

Info

Aws
cloud formation new vpc
leave all the defaults except
- cluster name
- ssh allowed cidr block
- availability zones (still three)
- region
- m3.large instance type (to ensure m4 doesn't cause an issue)
we aren't showing any warnings on our api limits from aws.

I am evaluating this for potential use for us, however I am not able to get passed the cloud formation stackup. if this isn't a quick fix, is there a better solution for installing your software on linux?

thank you

Cassandra backup/restore

Just would like to discuss ideas on the best way to have that implemented. I thought to integrate Netflix's Priam, but it doesn't seem to work as a backup/restore solution only.
Another cool tool is https://github.com/pearsontechnology/cassandra_snap. However, it needs ssh access to each instance and requires to enlist all nodes to take backup from, rather than figure that out automatically.
What are your thoughts?

get-service panics

firecamp-service-cli version is 0.9.2

# ./firecamp-service-cli -cluster=firecamp-qa -region=us-east-1 -service-name=kafka-qa -op=get-service
{ServiceUUID:ae9a07f638c145866458232d81edbead ServiceStatus:ACTIVE LastModified:1513076088004634004 Replicas:3 ClusterName:firecamp-qa ServiceName:kafka-qa Volumes:{PrimaryDeviceName:/dev/xvdm PrimaryVolume:{VolumeType:gp2 VolumeSizeGB:100 Iops:100} JournalDeviceName: JournalVolume:{VolumeType: VolumeSizeGB:0 Iops:0}} RegisterDNS:true DomainName:firecamp-qa-firecamp.com HostedZoneID:/hostedzone/Z1OA04B9KUSH29 RequireStaticIP:false UserAttr:<nil> Resource:{MaxCPUUnits:0 ReserveCPUUnits:0 MaxMemMB:0 ReserveMemMB:0}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6a8efe]

goroutine 1 [running]:
main.getService(0x7f36078f80e0, 0xc420014410, 0xc4200fbda0)
        /home/junius/work/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-service-cli/main.go:1472 +0x28e
main.main()
        /home/junius/work/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-service-cli/main.go:526 +0xc04

aws quickstart template misses latest version

https://s3.amazonaws.com/quickstart-reference/cloudstax/firecamp/latest/templates/firecamp-existingvpc.template:

Deploy FireCamp into existing VPC

Hello,

Is it possible to deploy a FireCamp cluster into an existing VPC?

Thanks
Cal

Kafkamanager does not come up

Created new firecamp cluster from scratch (used firecamp.template). Tried to start up kafkamanager service:

./firecamp-service-cli -cluster=firecamp-qa -region=us-east-1 -op=create-service -service-type=kafkamanager -service-name=kafkamanager-qa -km-heap-size=512 -km-zk-service=zoo-qa -km-user=user -km-passwd=pass
The Kafka Manager heap size is less than 4096. Please increase it for production system
2018-03-05 15:18:48.889010183 +0000 UTC The kafka manager service is created, wait for all containers running
2018-03-05 15:18:48.929875163 +0000 UTC wait the service containers running, RunningCount 0
...
2018-03-05 15:23:50.807640377 +0000 UTC not all service containers are running after 5m0s

firecamp-managesever log:

I0305 15:18:48.631132 1 route53.go:146] find hosted zone /hostedzone/ZL36HKC3OHITW for domain firecamp-qa-firecamp.com vpc vpc-d44e1eb1 us-east-1 requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.631143 1 route53.go:58] get hostedZoneID /hostedzone/ZL36HKC3OHITW for domain firecamp-qa-firecamp.com vpc vpc-d44e1eb1 us-east-1 requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.631154 1 service.go:135] get hostedZoneID /hostedzone/ZL36HKC3OHITW for domain firecamp-qa-firecamp.com vpc vpc-d44e1eb1 requuid req-649f1bce27b440564fceda5f5d983ae6 &{us-east-1 firecamp-qa kafkamanager-qa stateless}
I0305 15:18:48.638979 1 dynamodb_service.go:44] created service &{firecamp-qa kafkamanager-qa 4615cc0394144d224729850cfe4db686} requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.639002 1 service.go:695] created service &{firecamp-qa kafkamanager-qa 4615cc0394144d224729850cfe4db686} requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.644111 1 dynamodb_serviceattr.go:107] created service attr &{4615cc0394144d224729850cfe4db686 CREATING 1520263128639010452 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless} requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.644147 1 service.go:798] created service attr in db &{4615cc0394144d224729850cfe4db686 CREATING 1520263128639010452 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless} requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.644197 1 service.go:166] created service attr, requuid req-649f1bce27b440564fceda5f5d983ae6 &{4615cc0394144d224729850cfe4db686 CREATING 1520263128639010452 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless}
I0305 15:18:48.644241 1 dynamodb_serviceattr.go:144] update service status from CREATING to INITIALIZING requuid req-649f1bce27b440564fceda5f5d983ae6 &{4615cc0394144d224729850cfe4db686 INITIALIZING 1520263128644215963 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless}
I0305 15:18:48.649431 1 dynamodb_serviceattr.go:216] updated service attr &{4615cc0394144d224729850cfe4db686 CREATING 1520263128639010452 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless} to &{4615cc0394144d224729850cfe4db686 INITIALIZING 1520263128644215963 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless} requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.649471 1 service.go:185] successfully created service, requuid req-649f1bce27b440564fceda5f5d983ae6 &{4615cc0394144d224729850cfe4db686 INITIALIZING 1520263128644215963 1 firecamp-qa kafkamanager-qa { { 0 0 false} { 0 0 false}} true firecamp-qa-firecamp.com /hostedzone/ZL36HKC3OHITW false <nil> {0 256 0 512} stateless}
I0305 15:18:48.701299 1 cloudwatch.go:152] created log group firecamp-qa-kafkamanager-qa-4615cc0394144d224729850cfe4db686 requuid req-649f1bce27b440564fceda5f5d983ae6
I0305 15:18:48.734275 1 ecs.go:294] service is inactive kafkamanager-qa cluster firecamp-qa
I0305 15:18:48.760506 1 ecs.go:341] ListTaskDefinitionFamilies prefix firecamp-qa-kafkamanager-qa token <nil> resp {
Families: ["firecamp-qa-kafkamanager-qa"]
}

ECS console displays this error:

Status reason | CannotStartContainerError:  API error (500): failed to initialize logging driver:  ResourceNotFoundException: The specified log group does not exist. 	status code: 400, request id: ed451ab0-2088-11e8-a5e4-6f3c66053865

Looks like the service is still trying to create a log group in the outdated format:

 "requestParameters": { "logGroupName": "firecamp-firecamp-qa-kafkamanager-qa-b5291cb97e624299744ef6d9b9ce5ad9", "logStreamName": "kafkamanager-qa/firecamp-qa-kafkamanager-qa-container/544c5f99-e1e9-46b1-b50d-fe05a91aaaf7" },

Volume encryption at rest for AWS

Hi. Please, implement volume encryption at rest for AWS environment. Not sure if journal volumes should be encrypted. Probably not unless they contain sensitive data.

can't start zookeeper in 0.9.2

I'm sorry for bothering you, but this 0.9.2 release is a headache for me. Can you please check if you can start zookeeper in ECS with the following command:

# ./firecamp-service-cli -op=create-service -service-type=zookeeper -region=us-east-1 -cluster=firecamp-prod -service-name=zoo-prod -replicas=3 -volume-size=20 -zk-heap-size=512

I'm getting:

The ZooKeeper heap size is less than 4096. Please increase it for production system
The zookeeper service is created, wait for all containers running
wait the service containers running, RunningCount 0
...
wait the service containers running, RunningCount 1
not all service containers are running after 120

And finally have one zookeeper container running only.
Service events show:

85171af8-094f-48c8-95c1-8ddc1406cfd3
2018-01-25 20:51:55 +0300
service zoo-prod was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 986e672b-838a-4215-94c0-1ae8d8cf783b encountered error "memberOf constraint unsatisfied". For more information, see the Troubleshooting section.

cec4fbb1-b192-48d1-8747-a34c160a8481
2018-01-25 20:51:42 +0300
service zoo-prod has started 1 tasks: task d4e9c873-4629-4688-8b5b-9f8b1fcda874.

Firecamp log ends up with:

...
I0125 17:54:39.218207 1 server.go:688] get service status &{1 3} requuid req-722d99f1ef0c470c463dee0fe2e1dfea &{us-east-1 firecamp-prod zoo-prod}
I0125 17:54:44.219742 1 server.go:105] request Method GET URL /?Get-Service-Status ?Get-Service-Status Host firecamp-manageserver.firecamp-prod-firecamp.com:27040 requuid req-9a1353ee054041c96c770a55a24813c3 headers map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1] Content-Length:[73]]
I0125 17:54:44.236612 1 ecs.go:759] service zoo-prod has 1 running containers, desired 3
I0125 17:54:44.236634 1 server.go:688] get service status &{1 3} requuid req-9a1353ee054041c96c770a55a24813c3 &{us-east-1 firecamp-prod zoo-prod}
I0125 17:54:49.238279 1 server.go:105] request Method GET URL /?Get-Service-Status ?Get-Service-Status Host firecamp-manageserver.firecamp-prod-firecamp.com:27040 requuid req-97454a904f2b4c8161b8cf499e72d06a headers map[User-Agent:[Go-http-client/1.1] Content-Length:[73] Accept-Encoding:[gzip]]
I0125 17:54:49.256414 1 ecs.go:759] service zoo-prod has 1 running containers, desired 3
I0125 17:54:49.256441 1 server.go:688] get service status &{1 3} requuid req-97454a904f2b4c8161b8cf499e72d06a &{us-east-1 firecamp-prod zoo-prod}

Any ideas what's going on?

Cassandra: monitoring

Would like to open discussion on this topic. Some suggestions:

Add Jolokia jar to C* containers. It allows to easily fetch metrics by HTTP requests
Make it configurable (at C* service creation and C* update-service) to auto-create CloudWatch alarms for the most important metrics (like latency and free disk space).
Create a new service to run TICK (https://github.com/influxdata/sandbox) stack (in a single ECS task), which will retrieve metrics from C* (Telegraf), feed them into InfluxDB, provide a ready-to-use dashboards for various metrics (Chronograf).

Please, share your thoughts.

http://jolokia.org/agent/jvm.html
https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/
http://cassandra.apache.org/doc/latest/operating/metrics.html

managerserver fails to start if the cluster name is not in lowercase

I1124 15:39:09.383539       1 route53.go:133] zone is not for domain FireCamp-UAT-firecamp.com zone {
  CallerReference: "FireCamp-Route53H-1TC3D958X5KBG",
  Config: {
    PrivateZone: true
  },
  Id: "/hostedzone/ABCD",
  Name: "firecamp-uat-firecamp.com.",
  ResourceRecordSetCount: 2
} requuid
E1124 15:39:09.383573       1 route53.go:52] CreateHostedZone error DomainNotFound domain FireCamp-UAT-firecamp.com vpc vpc-0000000 us-east-1 requuid
E1124 15:39:09.383587       1 server_start.go:49] GetOrCreateHostedZoneIDByName error DomainNotFound domain FireCamp-UAT-firecamp.com vpcID vpc-000000
F1124 15:39:09.383596       1 main.go:171] StartServer error DomainNotFound

cloudstax / firecamp Goto Github PK

firecamp's People

Contributors

Stargazers

Watchers

Forkers

firecamp's Issues

Firecamp volume error log

ecs-agent error

Info

Recommend Projects

Recommend Topics

Recommend Org