Apache web server intermittent stalls
I've now resolved this so I'm posting the solution in case others experience the same issue.
I neglected to mention that all of our web traffic goes over HTTPS, and that appears to be the cause. During a stall I used strace and pstack to see what one of the idle Apache processes was doing. It was stuck waiting on a mutex for the SSL session cache.
Looking at Apache config I noticed we had SSLSessionCache enabled with a timeout of 5 minutes. Disabling this is the fix.
My guess is that the session cache was filling up, then Apache was waiting for older sessions to time out before continuing.
Related videos on Youtube
Mark Williams
Updated on September 18, 2022Comments
-
Mark Williams over 1 year
Our SOAP web server is running in PHP on Apache on CentOS and makes heavy use of MySQL. There is heavy demand on the server - most requests are very small and involve only two or three MySQL queries, but there are an awful lot of them - potentially a couple of hundred per second at peak times. Data traffic with each request is usually less than 1Kb, often only a few bytes.
The hardware this is running on is pretty decent, 18 cores with 32 Gb RAM, and it generally copes really well. CPU usage never really goes above 30%, physical RAM consumption never above 50%. However, every so often, the server appears to stall and Apache chokes up. This can last for around a minute before it loosens up again and normal service resumes.
I've analysed this in quite some depth to see what is going on during the stalls. Apache is maxed out on it's connections, pretty much all of which are in the 'reading' state. CPU usage drops to pretty much nothing, memory usage doesn't change, network and disk IO both plummet, so it looks like the system is just completely idle.
After doing a lot of Googling, I was led to believe this could be to do with some time out settings - network connections not being freed up quickly enough, and Apache running out. This would explain why Apache will resume normal operation after a while, it waits for them all to time out, then carries on. Doing a 'netstat -an' would support this as I do see a lot of connections in TIME_WAIT. However, I've reduced all sorts of timeout settings in the Apache configuration, and also with various net settings in sysctl.conf, but nothing appears to resolve the issue.
There is nothing at all in Apache's error logs. I've tried using 'ab' to stress test Apache - it appears to cause the intermittent stall to happen slightly sooner, but that's all I can really gauge from it. The max connections for Apache and MySQL are both set to high values - actual concurrent connections never come close except during the stall when the Apache connections max out.
I'm not really sure what else to try. Any ideas or pointers on things I might be missing here?
--edit--
A couple of extra observations. As the stall is occurring, I notice the number of connections in the ESTABLISHED state rise considerably, then the number in CLOSE_WAIT follows a few seconds later.
Also, when the stall occurs, the number of 'times the listen queue of a socket overflowed' and the 'SYNs to LISTEN sockets ignored' increases quite rapidly. During the intervals between stalls, these numbers do not change at all.
I'm not sure whether these numbers are a cause or a consequence of the stall. Any further help would be much appreciated.
-
Marcelo almost 10 yearsA performance drop from the DB engine would cause an increase on connections on the Apache side as they would last longer. Have you checked MySQL logs to see if the DB is causing the wait? (A record lock perhaps?)
-
Mark Williams almost 10 yearsDoing 'SHOW PROCESSLIST' in MySQL shows an empty list during the stall. MySQL CPU usage drops from 20% to 0% during the stall. I've also tried pointing 'ab' at an empty PHP file that does no MySQL stuff, and that can cause the issue. Strangely, it copes better pointing it at an empty HTML file, so I wonder if this is something to do with PHP?
-
Mark Williams almost 10 yearsIf the web server was under attack, would I not notice an increase in network traffic? If I restart Apache right at the start of one of the stalls, it picks up and works as normal - almost feels like periodically restarting Apache could be a fix ;-).
-
-
Mark Williams almost 10 yearsThanks Ron, my Keep-Alive Timeout is currently set to 2, so I don't think that is the issue. When the stall occurs, it can usually last for up to a minute.
-
Ron Lawhorn almost 10 yearsWell, we can assume it's not the concurrent connections hanging...
-
Ron Lawhorn almost 10 yearsFor the moment, let's assume it's not concurrent connections hanging, have you tried running an ipstate when your server hangs and see what's happening on your external interface?