zookeeper server not running

15,400

Your election port is being binded at sgX.imatiasl.lan/127.0.0.1:3888 for all nodes, so when the clients try to connect to sgY.imatiasl.lan/10.7.0.93:3888 it fails.

The election ports should bind to 0.0.0.0:3888 or the real IP of each node, but for some reason they are being resolved to 127.0.0.1. You can check the IP:port in each node with netstat -patun to confirm this.

Much probably you have some issue with /etc/hosts. Take a look at: https://unix.stackexchange.com/questions/240506/zookeeper-dns-name-problems-with-leader-elections-when-migrating-from-windows-to

Share:
15,400
NotGaeL
Author by

NotGaeL

None of the opinions expressed on the content I post here necessarily reflect the opinions of my employers (or my current ones). The dumb ones are probably sarcastic. https://abstrusegoose.com/249

Updated on June 05, 2022

Comments

  • NotGaeL
    NotGaeL almost 2 years

    I'm trying to start an hbase master from ambari.

    It can't start it because it can't connect to zookeper server.

    Ambari marks all the zookeper servers (3 nodes) as running.

    The application server (tomcat¿?) that runs the zookeper server application seems to be running fine; At least there is a service listening on the specified port.

    But the application is not able to connect to the other nodes and it seems like it doesn't start.

    All the connections are closed with the error message ZooKeeperServer not running on zookeeper server log, and zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket on the client.

    This is the zookeper server log output for those nodes (same log for all of them, only the node names change):

    2016-03-31 16:15:34,550 - INFO  [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
    2016-03-31 16:15:34,553 - INFO  [main:QuorumPeerConfig@338] - Defaulting to majority quorums
    2016-03-31 16:15:34,557 - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
    2016-03-31 16:15:34,557 - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
    2016-03-31 16:15:34,558 - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
    2016-03-31 16:15:34,565 - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
    2016-03-31 16:15:34,566 - INFO  [main:QuorumPeerMain@127] - Starting quorum peer
    2016-03-31 16:15:34,573 - INFO  [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
    2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@992] - tickTime set to 2000
    2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1012] - minSessionTimeout set to -1
    2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1023] - maxSessionTimeout set to -1
    2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1038] - initLimit set to 10
    2016-03-31 16:15:34,598 - INFO  [Thread-2:QuorumCnxManager$Listener@506] - My election bind port: sg1.imatiasl.lan/127.0.0.1:3888
    2016-03-31 16:15:34,607 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@747] - LOOKING
    2016-03-31 16:15:34,608 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@815] - New election. My id =  1, proposed zxid=0x0
    2016-03-31 16:15:34,609 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (
    n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEpoch) LOOKING (my state)
    2016-03-31 16:15:34,612 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@383] - Cannot open channel to 2 at election address sg2.imatiasl.lan/10.7.0.93:3888
    java.net.ConnectException: Connection refused
            at java.net.PlainSocketImpl.socketConnect(Native Method)
            at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
            at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
            at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
            at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
            at java.net.Socket.connect(Socket.java:589)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
            at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
            at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
            at java.lang.Thread.run(Thread.java:745)
    2016-03-31 16:15:34,614 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.lan/10.7.0.94:3888
    java.net.ConnectException: Conexión rehusada
            at java.net.PlainSocketImpl.socketConnect(Native Method)
            at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
            at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
            at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
            at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
            at java.net.Socket.connect(Socket.java:589)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
            at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
            at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
            at java.lang.Thread.run(Thread.java:745)
    2016-03-31 16:15:34,812 - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 2 at election address sg2.imatiasl.la
    n/10.7.0.93:3888
    java.net.ConnectException: Connection refused
            at java.net.PlainSocketImpl.socketConnect(Native Method)
            at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
            at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
            at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
            at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
            at java.net.Socket.connect(Socket.java:589)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:404)
            at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:795)
    2016-03-31 16:15:34,813 - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.la
    n/10.7.0.94:3888
    java.net.ConnectException: Connection refused
            at java.net.PlainSocketImpl.socketConnect(Native Method)
            at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
            at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
            at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
            at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
            at java.net.Socket.connect(Socket.java:589)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:404)
            at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:795)
    2016-03-31 16:15:34,813 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 400
    

    When the client tries to connect:

    2016-03-31 16:15:35,086 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.7.0.93:55914
    2016-03-31 16:15:35,130 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOExcep
    tion: ZooKeeperServer not running
    2016-03-31 16:15:35,130 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.7.0.93:55914 (no ses
    sion established for client)
    

    And so on...

    Any ideas on how to fix this?

  • NotGaeL
    NotGaeL about 8 years
    all of my nodes are refusing connections. This one cannot connect to the other two, and those two cannot connect to this one or each other. It happens when I try to start zookeeper. Ambari says it's starting right but, as you can see on the logs, it's not. Do you know why?
  • NotGaeL
    NotGaeL about 8 years
    I ran jps -l as the user zookeeper and got 612 org.apache.zookeeper.server.quorum.QuorumPeerMain, so I guess it is. What can I do now?
  • NotGaeL
    NotGaeL about 8 years
    Thank you! I was binding sgX to 127.0.0.1 in /etc/hosts of each sgX (which incidentally I did to solve some problem during the cluster setup I don't even remember), and that was the problem.
  • NotGaeL
    NotGaeL about 8 years
    Now I can start zookeeper and hbase RegionServer without problems, but HBase master is stil resisting. I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 58: ordinal not in range(128) on checked_call['curl -sS -L -w '%{http_code}' -X GET 'http://sg1.imatiasl.lan:50070/webhdfs/v1/apps/hbase/data?op‌​=GETFILESTATUS&user.‌​name=hdfs''] {'logoutput': None, 'user': 'hdfs', 'stderr': -1, 'quiet': False}. Do you know how I can solve this?
  • NotGaeL
    NotGaeL about 8 years
    (posted another question here: stackoverflow.com/questions/36409105/… )
  • 15412s
    15412s about 8 years
    can you reboot? or kill the processes? I think you wished to run just one service of zookeeper, do you run any script to init ZK that maybe has some bad loop inside?
  • Alfonso Nishikawa
    Alfonso Nishikawa about 8 years
    I don't use Ambari and I don't know python, but that seems to be related to 'ñ' and tildes. Try removing them wherever they are, or use unicode u'..' strings. I can't deduce much more :( salvatorelab.es/2013/12/…
  • quazardous
    quazardous almost 6 years
    as NotGael said, be aware that if you have bind some hostname to ip like 127.0.1.1 in /etc/hosts, even with multiple servers, Zookeeper will fail establishing server 2 server connexion