kubernetes connection refused during deployment

5,141

I got the same problem and tried to dig a bit deeper in the GKE network setup for this kind of LoadBalancing.

My suspicion is that the iptables rules on the node that runs the container are updated to early. I increased the timeouts a bit in your example to better find the stage in where the requests are getting timeouts.

My changes on your deployment:

spec:
...
  replicas: 1         # easier to track the state of the system
  minReadySeconds: 30 # give the load-balancer time to pick up the new node
...
  template:
    spec:
      containers:
        command: ["sh", "-c", "./hello-app"] # ignore SIGTERM and keep serving requests for 30s

Everything works well until the old pod switches from state Running to Terminating. I tested with a kubectl port-forward on the terminating pod and my requests were served without timeouts.

The following things happens during the change from Running to Terminating:

  • Pod-IP is removed from the service
  • Health check on the node returns 503 with "localEndpoints": 0
  • iptables rules are changed an that node and traffic for this service is dropped (--comment "default/myapp-lb: has no local endpoints" -j KUBE-MARK-DROP

The default settings of the load-balancer checks every 2 seconds and needs 5 failures to remove the node. This means for at least 10 seconds the packets are dropped. After I changed the interval to 1 and only switch after 1 failure the amount of dropped packages decreased.

If you are not interested in the source IP of the client, you could remove the line:

externalTrafficPolicy: Local

in your service definition and the deployments are without connection timeouts.

Tested on GKE Cluster with 4 nodes and version v1.9.7-gke.1.

Share:
5,141
thoas
Author by

thoas

Updated on September 18, 2022

Comments

  • thoas
    thoas over 1 year

    I'm trying to achieve a zero downtime deployment using kubernetes and during my test the service doesn't load balance well.

    My kubernetes manifest is:

    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      name: myapp-deployment
    spec:
      replicas: 3
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 0
          maxSurge: 1
      template:
        metadata:
          labels:
            app: myapp
            version: "0.2"
        spec:
          containers:
          - name: myapp-container
            image: gcr.io/google-samples/hello-app:1.0
            imagePullPolicy: Always
            ports:
              - containerPort: 8080
                protocol: TCP
            readinessProbe:
              httpGet:
                path: /
                port: 8080
              initialDelaySeconds: 5
              periodSeconds: 5
              successThreshold: 1
    
    ---
    
    apiVersion: v1
    kind: Service
    metadata:
      name: myapp-lb
      labels:
        app: myapp
    spec:
      type: LoadBalancer
      externalTrafficPolicy: Local
      ports:
        - port: 80
          targetPort: 8080
      selector:
        app: myapp
    

    If I loop over the service with the external IP, let's say:

    $ kubectl get services
    NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)        AGE
    kubernetes   ClusterIP      10.35.240.1    <none>           443/TCP        1h
    myapp-lb     LoadBalancer   10.35.252.91   35.205.100.174   80:30549/TCP   22m
    

    using the bash script:

    while True
        do
            curl 35.205.100.174 
            sleep 0.2s
        done
    

    I receive some connection refused during the deployment:

    curl: (7) Failed to connect to 35.205.100.174 port 80: Connection refused

    The application is the default helloapp provided by Google Cloud Platform and running on 8080.

    Cluster information:

    • Kubernetes version: 1.8.8
    • Google cloud platform
    • Machine type: g1-small
    • DevopsTux
      DevopsTux about 6 years
      how frequently are you getting those connection refused? I'm trying right now the same deployment as you and removed the sleep to stress test the service and right now I'm at around 2000 requests and 0 fails.
    • DevopsTux
      DevopsTux about 6 years
      0 errors on 20.000 requests now.
    • thoas
      thoas about 6 years
      it occurs during a deployment only, try changing the version and restart the script. If I siege the service internal ip or external ip I get some connection refused
    • DevopsTux
      DevopsTux almost 6 years
      where are you launching the siege from exactly?
    • thoas
      thoas almost 6 years
      the siege is launched locally and also tested un a busybox directly in the cluster using the Cluster IP
  • thoas
    thoas almost 6 years
    thank you for your answer, siege is not the issue here since we have tested on multiple servers and even with a dead simple curl loop.
  • thoas
    thoas almost 6 years
    same issue with the minReadySeconds, I get a curl: (56) Recv failure: Connection reset by peer during a deployment
  • DevopsTux
    DevopsTux almost 6 years
    What siege does is, in fact, is a bit like a curl loop. You will end up running out of sockets with either, Kubernetes is not the problem here. Did you try my answer?
  • thoas
    thoas almost 6 years
    yes I tried your answer, it's not related to the HTTP client (we are also testing it in pur python with only one connection) and the ingress is returning some 502 status code.
  • DevopsTux
    DevopsTux over 5 years
    Is it possible you are getting this error while the public IP for the load balancer is provisioned?