Comments (18)
This looks like a rolling restart steps for zk. Would be an issue of zk itself. The full logs would help to further analyse the issue.
from firecamp.
The cluster has been running for several months w/o issues and I probably don't have the logs of point of start anymore
zoo.log.gz
. Please, check the logs attached and let me know if you need something else.
from firecamp.
Not sure the root cause. The node looks not able to connect to zk on 172.31.4.202.
2018-10-11 10:10:29,214 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@584] - Cannot open channel to 3 at election address zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:534)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:454)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:435)
at java.lang.Thread.run(Thread.java:748)
2018-10-11 10:10:29,216 [myid:2] - INFO [WorkerSender[myid=2]:QuorumPeer$QuorumServer@184] - Resolved hostname: zoo-prod-2.firecamp-prod-firecamp.com to address: zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202
from firecamp.
Yeah, I'm not sure as well. The main point is that it's got fixed after restarting zookeeper. Probably we need to wait for all zookeeper nodes to start up before starting up kafka?
from firecamp.
Are you able to get logs from node 172.31.4.202?
Kafka does rely on ZK. If zk cluster is not working, kafka will not work. We could consider to introduce the dependency between services. While, how to detecting one service is healthy might not be easy. It has to look into the service internal status. For kafka, it is not necessary to do so. Kafka itself will wait till zk is running.
from firecamp.
What kind of logs do you need?
from firecamp.
zk logs, to see if it showed some information about why connecting fails.
from firecamp.
@JuniusLuo, please check #80 (comment) for zookeeper logs
from firecamp.
there is only one log file. We need the log file for zk on 172.31.4.202
from firecamp.
from firecamp.
It looks like 4.202 is zoo2. zoo2 was not started at 10:10:29. The first log in zoo2.log.gz was at 10:11:14.
2018-10-11 10:11:14,536 [myid:] - INFO [main:QuorumPeerConfig@136] - Reading configuration from: /etc/zk/zoo.cfg
2018-10-11 10:11:14,560 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: zoo-prod-2.firecamp-prod-firecamp.com to address: zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202
Probably when system was not stable, zk instance kept restarting itself. Restarting the zk service helps to bring all instances up around the same time. This looks like a zk issue itself. Probably zk instance should just wait and retry.
from firecamp.
Could you please make firecamp-manager aware of this issue, so Kafka will be restarted only after all ZK instances are up and running?
from firecamp.
The manager service aims to be a common service. Monitoring the specific service healthy status is too specific to the service. Looks not the best fit to have the manager service to do this.
from firecamp.
@JuniusLuo , in general, I agree with you. But this particular service - Kafka - does not work at all without Zookeeper. So, they're tied by design. Thinking that would be a good feature to handle this case in the manager.
Otherwise, we probably need to put that in docs, so other people won't be messed.
from firecamp.
This is not an easy task. This is like requiring the full monitoring ability to ZooKeeper. For example, if ZooKeeper fails because of some bug/issue, Kafka will not work as well.
from firecamp.
We just need to start Kafka containers after Zookeeper is got up. Do you really think it's hard to implement?
from firecamp.
Currently we don't have the control for which service starts first. Firecamp is simple. The manage service is simply responsible for initializing the service and updating the service configs, such as creating volumes, etc. Then it is ECS's responsible to schedule the containers, and the manage service does not involve on the scheduling. Filecamp plugin will talk with DynamoDB to grab the volume and update network.
from firecamp.
Understand. What do you think of adding this issue somewhere in the wiki? So people might be aware of such things
from firecamp.
Related Issues (20)
- Kafka JMX metrics are not available HOT 1
- update service doesn't change task definition HOT 2
- Zookeeper JMX port is not reachable
- Automatically add CloudWatch Logs filters and alarms
- Kafka configuration changes HOT 6
- MySQL/MardiaDB support? HOT 2
- Unable to start kafka service HOT 10
- zookeeper error HOT 2
- The following resource(s) failed to create: [LambdaCustomResource] HOT 4
- Replace Kafka with the newest HOT 10
- how to update ecs agent? HOT 1
- Referencing Subnets (need output in master stack) HOT 1
- Unable to connect kafka outside containers programatically HOT 2
- Enable SSL for kafka HOT 4
- New ecs agent HOT 1
- Restore kafka data from another volume/snapshot
- Multi-Region Deployment HOT 1
- Questions about enable_materialized_views and enable_transient_replication HOT 1
- Show Error Details
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from firecamp.