kubernetes connection refused during deployment

google-cloud-platform kubernetes google-kubernetes-engine

5,141

I got the same problem and tried to dig a bit deeper in the GKE network setup for this kind of LoadBalancing.

My suspicion is that the iptables rules on the node that runs the container are updated to early. I increased the timeouts a bit in your example to better find the stage in where the requests are getting timeouts.

My changes on your deployment:

spec:
...
  replicas: 1         # easier to track the state of the system
  minReadySeconds: 30 # give the load-balancer time to pick up the new node
...
  template:
    spec:
      containers:
        command: ["sh", "-c", "./hello-app"] # ignore SIGTERM and keep serving requests for 30s

Everything works well until the old pod switches from state Running to Terminating. I tested with a kubectl port-forward on the terminating pod and my requests were served without timeouts.

The following things happens during the change from Running to Terminating:

Pod-IP is removed from the service
Health check on the node returns 503 with "localEndpoints": 0
iptables rules are changed an that node and traffic for this service is dropped (--comment "default/myapp-lb: has no local endpoints" -j KUBE-MARK-DROP

The default settings of the load-balancer checks every 2 seconds and needs 5 failures to remove the node. This means for at least 10 seconds the packets are dropped. After I changed the interval to 1 and only switch after 1 failure the amount of dropped packages decreased.

If you are not interested in the source IP of the client, you could remove the line:

externalTrafficPolicy: Local

in your service definition and the deployments are without connection timeouts.

Tested on GKE Cluster with 4 nodes and version v1.9.7-gke.1.

5,141

Author by

thoas

Updated on September 18, 2022

Comments

thoas over 1 year

I'm trying to achieve a zero downtime deployment using kubernetes and during my test the service doesn't load balance well.

My kubernetes manifest is:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myapp-deployment
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app: myapp
        version: "0.2"
    spec:
      containers:
      - name: myapp-container
        image: gcr.io/google-samples/hello-app:1.0
        imagePullPolicy: Always
        ports:
          - containerPort: 8080
            protocol: TCP
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1

---

apiVersion: v1
kind: Service
metadata:
  name: myapp-lb
  labels:
    app: myapp
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: myapp

If I loop over the service with the external IP, let's say:

$ kubectl get services
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)        AGE
kubernetes   ClusterIP      10.35.240.1    <none>           443/TCP        1h
myapp-lb     LoadBalancer   10.35.252.91   35.205.100.174   80:30549/TCP   22m

using the bash script:

while True
    do
        curl 35.205.100.174 
        sleep 0.2s
    done

I receive some connection refused during the deployment:

curl: (7) Failed to connect to 35.205.100.174 port 80: Connection refused

The application is the default helloapp provided by Google Cloud Platform and running on 8080.

Cluster information:

Kubernetes version: 1.8.8
Google cloud platform
Machine type: g1-small

DevopsTux about 6 years

how frequently are you getting those connection refused? I'm trying right now the same deployment as you and removed the sleep to stress test the service and right now I'm at around 2000 requests and 0 fails.
DevopsTux about 6 years

0 errors on 20.000 requests now.
thoas about 6 years

it occurs during a deployment only, try changing the version and restart the script. If I siege the service internal ip or external ip I get some connection refused
DevopsTux almost 6 years

where are you launching the siege from exactly?
thoas almost 6 years

the siege is launched locally and also tested un a busybox directly in the cluster using the Cluster IP

thoas almost 6 years

thank you for your answer, siege is not the issue here since we have tested on multiple servers and even with a dead simple curl loop.
thoas almost 6 years

same issue with the minReadySeconds, I get a curl: (56) Recv failure: Connection reset by peer during a deployment
DevopsTux almost 6 years

What siege does is, in fact, is a bit like a curl loop. You will end up running out of sockets with either, Kubernetes is not the problem here. Did you try my answer?
thoas almost 6 years

yes I tried your answer, it's not related to the HTTP client (we are also testing it in pur python with only one connection) and the ingress is returning some 502 status code.
DevopsTux over 5 years

Is it possible you are getting this error while the public IP for the load balancer is provisioned?