kubernetes connection refused during deployment
I got the same problem and tried to dig a bit deeper in the GKE network setup for this kind of LoadBalancing.
My suspicion is that the iptables rules on the node that runs the container are updated to early. I increased the timeouts a bit in your example to better find the stage in where the requests are getting timeouts.
My changes on your deployment:
spec:
...
replicas: 1 # easier to track the state of the system
minReadySeconds: 30 # give the load-balancer time to pick up the new node
...
template:
spec:
containers:
command: ["sh", "-c", "./hello-app"] # ignore SIGTERM and keep serving requests for 30s
Everything works well until the old pod switches from state Running
to Terminating
. I tested with a kubectl port-forward
on the terminating pod and my requests were served without timeouts.
The following things happens during the change from Running
to Terminating
:
- Pod-IP is removed from the service
- Health check on the node returns 503 with
"localEndpoints": 0
- iptables rules are changed an that node and traffic for this service is dropped (
--comment "default/myapp-lb: has no local endpoints" -j KUBE-MARK-DROP
The default settings of the load-balancer checks every 2 seconds and needs 5 failures to remove the node. This means for at least 10 seconds the packets are dropped. After I changed the interval to 1 and only switch after 1 failure the amount of dropped packages decreased.
If you are not interested in the source IP of the client, you could remove the line:
externalTrafficPolicy: Local
in your service definition and the deployments are without connection timeouts.
Tested on GKE Cluster with 4 nodes and version v1.9.7-gke.1
.
thoas
Updated on September 18, 2022Comments
-
thoas over 1 year
I'm trying to achieve a zero downtime deployment using kubernetes and during my test the service doesn't load balance well.
My kubernetes manifest is:
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: myapp-deployment spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 1 template: metadata: labels: app: myapp version: "0.2" spec: containers: - name: myapp-container image: gcr.io/google-samples/hello-app:1.0 imagePullPolicy: Always ports: - containerPort: 8080 protocol: TCP readinessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 --- apiVersion: v1 kind: Service metadata: name: myapp-lb labels: app: myapp spec: type: LoadBalancer externalTrafficPolicy: Local ports: - port: 80 targetPort: 8080 selector: app: myapp
If I loop over the service with the external IP, let's say:
$ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.35.240.1 <none> 443/TCP 1h myapp-lb LoadBalancer 10.35.252.91 35.205.100.174 80:30549/TCP 22m
using the bash script:
while True do curl 35.205.100.174 sleep 0.2s done
I receive some
connection refused
during the deployment:curl: (7) Failed to connect to 35.205.100.174 port 80: Connection refused
The application is the default helloapp provided by Google Cloud Platform and running on 8080.
Cluster information:
- Kubernetes version: 1.8.8
- Google cloud platform
- Machine type: g1-small
-
DevopsTux about 6 yearshow frequently are you getting those connection refused? I'm trying right now the same deployment as you and removed the sleep to stress test the service and right now I'm at around 2000 requests and 0 fails.
-
DevopsTux about 6 years0 errors on 20.000 requests now.
-
thoas about 6 yearsit occurs during a deployment only, try changing the
version
and restart the script. If I siege the service internal ip or external ip I get someconnection refused
-
DevopsTux almost 6 yearswhere are you launching the siege from exactly?
-
thoas almost 6 yearsthe siege is launched locally and also tested un a busybox directly in the cluster using the Cluster IP
-
thoas almost 6 yearsthank you for your answer,
siege
is not the issue here since we have tested on multiple servers and even with a dead simplecurl
loop. -
thoas almost 6 yearssame issue with the
minReadySeconds
, I get acurl: (56) Recv failure: Connection reset by peer
during a deployment -
DevopsTux almost 6 yearsWhat siege does is, in fact, is a bit like a curl loop. You will end up running out of sockets with either, Kubernetes is not the problem here. Did you try my answer?
-
thoas almost 6 yearsyes I tried your answer, it's not related to the HTTP client (we are also testing it in pur python with only one connection) and the ingress is returning some
502
status code. -
DevopsTux over 5 yearsIs it possible you are getting this error while the public IP for the load balancer is provisioned?