Varnish 3.0.2 to Apache2 sometimes return error 503
I was running into this with Apache, and the solution was a combination of the following (note that I'm using Apache/2.4.7 (Ubuntu) + varnish 3.0.5-2 on Ubuntu 14.04 LTS in AWS EC2):
Please keep in mind that this was made for an M3.Medium instance on Amazon EC2 (1x Intel Xeon E5-2670 core + 3.75GB RAM). Adjust as necessary for your hardware!
-
In
/etc/default/varnish
, edit your start-up options:DAEMON_OPTS="-a :80 \ -T localhost:6082 \ -f /etc/varnish/default.vcl \ -S /etc/varnish/secret \ -p thread_pools=2 \ -p thread_pool_max=600 \ -p listen_depth=1024 \ -p lru_interval=900 \ -p connect_timeout=600 \ -p max_restarts=6 \ -s malloc,1G"
-
In
/etc/varnish/default.vcl
or whatever your VCL is, change the back-end timeouts (note that we're also setting these in /etc/default/varnish):backend default { .host = "127.0.0.1"; .port = "8000"; .connect_timeout = 600s; .first_byte_timeout = 600s; .between_bytes_timeout = 600s; }
Disable KeepAlives. This page has more information (varies depending on back-end web server software): http://www.feedthebot.com/pagespeed/keep-alive.html
For Apache, all I had to do was change line 92 in /etc/apache2/apache2.conf
to the following:
KeepAlive Off
What I think is going on here is that the KeepAlives, as implemented in the back-end web server software, are sending explicit connection resets, which Varnish doesn't work well with. There is probably more to this story, and I encourage you to dig into this and post your findings here for future generations to learn from.
Additional reading: - https://www.varnish-cache.org/trac/wiki/Future_Feature#Keepalivetimeoutonbackendconnections ( and a few more, but can't post the links. Some Googling for "varnish keepalive backend timeout" should surface what you want)
More debugging help:
If you're still stuck, try doing the following:
- start varnishlog -w err.log
on your Varnish server
- On your client, get Siege: http://www.joedog.org/siege-home/ and load it up with some of the URLs you've seen 503 (hint: urls.txt, use -i -b -c500 -r10
and it should be enough to trigger the 503s)
- start varnishlog -r temp -c -m 'TxStatus:503' > err-parsed.txt
. This will grab all the Varnish log entries where Varnish returned a 503. FWIW, here's the full text of one of my errors. TL;DR the error Varnish was reporting was FetchError c http first read error: -1 0 (Success)
:
936 SessionOpen c 10.8.226.98 51895 :80
936 ReqStart c 10.8.226.98 51895 357447130
936 RxRequest c GET
936 RxURL c /ip/69.120.68.54
936 RxProtocol c HTTP/1.1
936 RxHeader c Host: 10.201.81.157
936 RxHeader c Accept: */*
936 RxHeader c Accept-Encoding: gzip
936 RxHeader c User-Agent: Mozilla/5.0 (apple-x86_64-darwin11.4.2) Siege/3.0.5
936 RxHeader c Connection: close
936 VCL_call c recv lookup
936 VCL_call c hash
936 Hash c /ip/69.120.68.54
936 Hash c 10.201.81.157
936 VCL_return c hash
936 HitPass c 357445183
936 VCL_call c pass pass
936 Backend c 103 default default
936 FetchError c http first read error: -1 0 (Success)
936 Backend c 269 default default
936 FetchError c http first read error: -1 0 (Success)
936 VCL_call c error deliver
936 VCL_call c deliver deliver
936 TxProtocol c HTTP/1.1
936 TxStatus c 503
936 TxResponse c Service Unavailable
936 TxHeader c Server: Varnish
936 TxHeader c Content-Type: text/html; charset=utf-8
936 TxHeader c Retry-After: 5
936 TxHeader c Content-Length: 418
936 TxHeader c Accept-Ranges: bytes
936 TxHeader c Date: Thu, 05 Jun 2014 23:05:48 GMT
936 TxHeader c X-Varnish: 357447130
936 TxHeader c Age: 0
936 TxHeader c Via: 1.1 varnish
936 TxHeader c Connection: close
936 Length c 418
Hope this helps!
Related videos on Youtube
Ronnie Jespersen
Updated on September 18, 2022Comments
-
Ronnie Jespersen over 1 year
Hey guys I hope you can help me out here.
I have an Ngingx parsing http and https to a varnish cache(3.0.2). From the varnish it is sent to apache2. Now I have for some time been tracking some strange 503 errors. But I cant seem to find the silver bullet.
Currently I am logging the 503 errors through varnish this way:
sudo varnishlog -c -m TxStatus:503 >> /home/rj/varnishlog503.log
and then referring to the apache access log to see if any 503 requests have been handled.
Today I had a health check from the firewall that failed:
20 SessionOpen c 127.0.0.1 34319 :8081 20 ReqStart c 127.0.0.1 34319 607335635 20 RxRequest c HEAD 20 RxURL c /health-check 20 RxProtocol c HTTP/1.0 20 RxHeader c X-Real-IP: 192.168.3.254 20 RxHeader c Host: 192.168.3.189 20 RxHeader c X-Forwarded-For: 192.168.3.254 20 RxHeader c Connection: close 20 RxHeader c User-Agent: Astaro Service Monitor 0.9 20 RxHeader c Accept: */* 20 VCL_call c recv lookup 20 VCL_call c hash 20 Hash c /health-check 20 VCL_return c hash 20 VCL_call c miss fetch 20 Backend c 33 aurum aurum 20 FetchError c http first read error: -1 11 (No error recorded) 20 VCL_call c error deliver 20 VCL_call c deliver deliver 20 TxProtocol c HTTP/1.1 20 TxStatus c 503 20 TxResponse c Service Unavailable 20 TxHeader c Server: Varnish 20 TxHeader c Content-Type: text/html; charset=utf-8 20 TxHeader c Retry-After: 5 20 TxHeader c Content-Length: 879 20 TxHeader c Accept-Ranges: bytes 20 TxHeader c Date: Wed, 06 Jun 2012 12:35:12 GMT 20 TxHeader c X-Varnish: 607335635 20 TxHeader c Age: 60 20 TxHeader c Via: 1.1 varnish 20 TxHeader c Connection: close 20 Length c 879 20 ReqEnd c 607335635 1338986052.649786949 1338986112.648169994 0.000160217 59.997980356 0.000402689
Now the backend server (apache) does not have any 503 error in the access log at this point. So I am confused. Is this varnish throwing a 503 because it thinks apache is to slow? There is a lot traffic coming through at this point so I know the server is up and running.
I do have other 503 error codes with posts and gets so there is really no pattern. It seems to be at random times and random requests. Even in the morning when the server dosen't seem to be doing anything.
I do see another pattern in the log:
4 VCL_call c recv pass 4 VCL_call c hash 4 Hash c /?id=412 4 VCL_return c hash 4 VCL_call c pass pass 4 FetchError c no backend connection 4 VCL_call c error deliver 4 VCL_call c deliver deliver
Here fetcherror says "no backend connection". A summery of the FetchErrors in todays log:
16 FetchError c http first read error: -1 11 (No error recorded) 5 FetchError c http first read error: -1 11 (No error recorded) 4 FetchError c http first read error: -1 11 (No error recorded) 19 FetchError c http first read error: -1 11 (No error recorded) 5 FetchError c http first read error: -1 11 (No error recorded) 23 FetchError c http first read error: -1 11 (No error recorded) 24 FetchError c http first read error: -1 11 (No error recorded) 16 FetchError c http first read error: -1 11 (No error recorded) 6 FetchError c http first read error: -1 11 (No error recorded) 4 FetchError c http first read error: -1 11 (No error recorded) 5 FetchError c http first read error: -1 11 (No error recorded) 4 FetchError c http first read error: -1 11 (No error recorded) 4 FetchError c http first read error: -1 11 (No error recorded) 22 FetchError c http first read error: -1 11 (No error recorded) 6 FetchError c http first read error: -1 11 (No error recorded) 21 FetchError c http first read error: -1 11 (No error recorded) 26 FetchError c no backend connection 4 FetchError c no backend connection 20 FetchError c http first read error: -1 11 (No error recorded) 39 FetchError c http first read error: -1 11 (No error recorded)
I haven't changed the default timeout values for varnish. This is my configuration for one of the backend servers.
backend xenon { .host = "192.168.3.187"; .port = "80"; .probe = { .url = "/health-check/"; .interval = 3s; .window = 5; .threshold = 2; } }
I'm running prefork module on apache2 with this configuration
<IfModule mpm_prefork_module> StartServers 1 MinSpareServers 2 MaxSpareServers 5 MaxClients 200 MaxRequestsPerChild 75 </IfModule>
and only PHP files is sent to the server. Every other static file is handled by Nginx.
Any ideas?
------- EDIT --------------
Some more debuging information
I have run a varnishadm debug.health
Backend radon is Healthy Current states good: 5 threshold: 2 window: 5 Average responsetime of good probes: 0.002560 Oldest Newest ================================================================ 4444444444444444444444444444444444444444444444444444444444444444 Good IPv4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy Backend xenon is Healthy Current states good: 5 threshold: 2 window: 5 Average responsetime of good probes: 0.002760 Oldest Newest ================================================================ 4444444444444444444444444444444444444444444444444444444444444444 Good IPv4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy Backend iridium is Healthy Current states good: 5 threshold: 2 window: 5 Average responsetime of good probes: 0.000849 Oldest Newest ================================================================ 4444444444444444444444444444444444444444444444444444444444444444 Good IPv4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy Backend aurum is Healthy Current states good: 5 threshold: 2 window: 5 Average responsetime of good probes: 0.002100 Oldest Newest ================================================================ 4444444444444444444444444444444444444444444444444444444444444444 Good IPv4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy
And I have been monitoring varnishstat from the two load balancers
3224774 3.99 2.61 backend_conn - Backend conn. success 27 0.00 0.00 backend_unhealthy - Backend conn. not attempted 63 0.00 0.00 backend_fail - Backend conn. failures 358798 0.00 0.29 backend_reuse - Backend conn. reuses 21035 0.00 0.02 backend_toolate - Backend conn. was closed 379834 0.00 0.31 backend_recycle - Backend conn. recycles 26 0.00 0.00 backend_retry - Backend conn. retry 3217751 5.99 2.61 backend_conn - Backend conn. success 32 0.00 0.00 backend_fail - Backend conn. failures 364185 0.00 0.30 backend_reuse - Backend conn. reuses 27077 0.00 0.02 backend_toolate - Backend conn. was closed 391263 0.00 0.32 backend_recycle - Backend conn. recycles 36 0.00 0.00 backend_retry - Backend conn. retry
Notice that none of them have reported backend_fail.
/Ronnie
-
Aaron Hudon almost 7 yearsIf the Apache server rejects connections for example because the syn queue overflows you won't see that in the Apache logfiles. It does show up in the
netstat -s
counters but it will be hard to correlate.
-
-
Ronnie Jespersen almost 12 yearsHey KM01. Yes I didn't mean error :) The access log contains all requests and I was looking for those with the http status code of 503. I cant find any which makes me believe that the requests never get to the apache server but stops in varnish. The current apache configuration is still good considering that apachge don't server any static files.
-
Jack ilkgu almost 12 yearsYes, u'r right, I've to remember the apache logs 503s to access_log even though it is an error (-: but we digress ... So, the
No backend connection
is the part that makes me think that you should have a second look at your prefork settings. Looks like there isn't an apache process available to fulfill the incoming request. -
Ronnie Jespersen almost 12 yearsCould you then explain to me how apache works then? Cause I though that apache would spawn new processes to handle the requests. Up to the default value of 256 simultaneous requests. Shouldn't it??
-
Ronnie Jespersen almost 12 yearsBy the way I have changed the values to on the 2 backend servers so lets see what happens today :)
-
Ronnie Jespersen almost 12 yearsI'm afraid it haven't solved the issue. I still receive 503 error in my varnish log and not in my apache access log even though I have more than doubled my values in the apache configuration.
-
Jack ilkgu almost 12 yearsBut you've eliminated apache (-: most likely, so now, let us focus on Varnish. Run a
debug.health
from the varnishadm CLI? I think avarnishlog -i Debug_health
will work as well. -
Ronnie Jespersen almost 12 yearsIf have the data now but I guess its really not helpful unless a catch one of the backends being down? All servers report healthy and have: Average responsetime of good probes: 0.002100
-
Ronnie Jespersen almost 12 yearsI have added some of the debuing data in the first thread.
-
Jack ilkgu almost 12 yearsI'm afraid that doesn't say much )-: You'll have to try to grab data when the 503 occurs. You can also try posting your question on the varnish mailing list varnish-cache.org/community/mailing-lists or the discussion forums: varnish-cache.org/forum/1 ... sorry couldn't be of more help )-:
-
Ronnie Jespersen almost 12 yearsOh my sorry... I just found out that the 503 indeed exists in the access log. It seems more and more like an apache issue.
-
mostafaznv over 5 years
.connect_timeout = 600s;
solved my problem