Seeing 502 on Google App Engine

nginx google-cloud-platform node.js google-app-engine

9,852

Unfortunately, there is a myriad of reasons for a spike in 502 Errors such as:

The backend instance took longer than the Backend Service timeout to respond, meaning either the application is overloaded or the Backend Service Timeout is set too low.
The front end was not able to establish a connection to a backend instance.
The front end was not able to identify a viable backend instance to connect to. (Health Checks failing for all backends)

To get more information, you would need to look into the Stackdriver Logging for the 502 Errors from your Cloud Console.

The next time the spikes occur, something you can check is if the 502 errors are caused by your health checks giving false positives. There was another ServerFault post that had the same issue that can let you know a little more. If this is the case, you may want to look into increasing your instance’s disk space.

To avoid further spikes, I do suggest you add readiness checks to your app.yaml file along with the liveness checks so that your instance does not get any traffic before it is absolutely ready to take it in. You may have already seen it, but here is the documentation for adding readiness checks

One final thing to check would be if the percentage of traffic with the spike you’ve had in comparison to all of your traffic falls under the SLA.

9,852

Admin

Updated on September 18, 2022

Comments

Admin over 1 year
I'm seeing quite a few 502 "Bad Gateway" errors on Google App Engine. It's difficult to see on the chart below (the colors are very similar and I can't figure out how to change them), but this is my traffic over the past 6 hours:

The dark pink line represents 5xx errors. They started around 9:30a this morning and calmed down around 12:30p PST. But for those 3 hours nginx was returning 502 Bad Gateway pretty consistently. And then it just stopped.

During that time, the only commits I made to the code to try and alter the behavior was to increase each instance from 0.5 to 1G of memory and increased the cache TTL on some 404 responses. I also added a liveness check so nginx would know when app servers were down.

I checked nginx' error log, and saw a bunch of these:
```
failed (111: Connection refused) while connecting to upstream
```
I triple checked and all my app servers are running on port 8080, so I ruled that out. I'm thinking that maybe the liveness check helped app engine to know when to reboot the servers that needed it, but I don't see anything in the stdout logs from the app servers that indicates that any of them were bad.

Could this just be an app engine error of some kind?

EDIT @ 9:17p PST: Below is an image of my App Engine traffic over the past 24 hours, with minimal code changes to the app. I've highlighted the 5xx spikes so you can see them more clearly.