NGINX timeout after +200 concurrent connections
Solution 1
You will need to dump your network connections during the test. While the server may have near zero load, your TCP/IP stack could be billing up. Look for TIME_WAIT connections in a netstat output.
If this is the case, then you will want to check into tuning tcp/ip kernel paramters relating to TCP Wait states, TCP recyling, and similar metrics.
Also, you have not described what is being tested.
I always test:
- static content (image or text file)
- simple php page (phpinfo for example)
- application page
This may not apply in your case but is something I do when performance testing. Testing different types of files can help you pinpoint the bottlneck.
Even with static content, testing different size of files is important as well to get timeouts and other metrics dialed in.
We have some static content Nginx boxes handling 3000+ active connections. So it Nginx can certainly do it.
Update: Your netstat shows a lot of open connections. May want to try tuning your TCP/IP stack. Also, what file are you requesting? Nginx should quickly close the port.
Here is a suggestion for sysctl.conf:
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
These values are very low but I have had success with them on high concurrency Nginx boxes.
Solution 2
I was having a very similar issue with a nginx box serving as load balancer with an upstream of apache servers.
In my case I was able to isolate the problem to be networking related as the upstream apache servers became overloaded. I could recreate it with simple bash scripts while the overall system was under load. According to an strace of one of the hung processes the connect call was getting an ETIMEDOUT.
These settings (on the nginx and upstream servers) eliminated the problem for me. I was getting 1 or 2 timeouts per minute before making these changes (boxes handling ~100 reqs/s) and now get 0.
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 20480
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 400000
net.core.somaxconn = 4096
I would not recommend using net.ipv4.tcp_tw_recycle or net.ipv4.tcp_tw_reuse, but if you want to use one go with the latter. They can cause bizarre issues if there is any kind of latency at all and the latter is at least the safer of the two.
I think having tcp_fin_timeout set to 1 above may be causing some trouble as well. Try putting it at 20/30 - still far below the default.
Solution 3
maybe is not nginx problem, while you test on blitz.io do a :
tail -f /var/log/php5-fpm.log
(thats what i am using to handle the php)
this triggers an error and the timeouts starts to raise:
WARNING: [pool www] server reached pm.max_children setting (5), consider raising it
so, put more max_children on fmp conf and its done! ;D
Solution 4
Yet another hypothesis. You have increased worker_rlimit_nofile
, but the maximum number of clients is defined in the documentation as
max_clients = worker_processes * worker_connections
What if you try to raise worker_connections
to, like, 8192? Or, if there's enough CPU cores, increase worker_processes
?
Related videos on Youtube
Gajus
Updated on September 18, 2022Comments
-
Gajus over 1 year
This is my
nginx.conf
(I've updated config to make sure that there is no PHP involved or any other bottlenecks):user nginx; worker_processes 4; worker_rlimit_nofile 10240; pid /var/run/nginx.pid; events { worker_connections 1024; } http { include /etc/nginx/mime.types; error_log /var/www/log/nginx_errors.log warn; port_in_redirect off; server_tokens off; sendfile on; gzip on; client_max_body_size 200M; map $scheme $php_https { default off; https on; } index index.php; client_body_timeout 60; client_header_timeout 60; keepalive_timeout 60 60; send_timeout 60; server { server_name dev.anuary.com; root "/var/www/virtualhosts/dev.anuary.com"; } }
I am using http://blitz.io/play to test my server (I bought the 10 000 concurrent connections plan). In a 30 seconds run, I get
964
hits and5,587 timeouts
. The first timeout happened at 40.77 seconds into the test when the number of concurrent users was at 200.During the test, the server load was (
top
output):PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20225 nginx 20 0 48140 6248 1672 S 16.0 0.0 0:21.68 nginx 1 root 20 0 19112 1444 1180 S 0.0 0.0 0:02.37 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/0
Therefore it is not server resource issue. What is it then?
UPDATE 2011 12 09 GMT 17:36.
So far I did the following changes to make sure that the bottleneck is not TCP/IP. Added to
/etc/sysctl.conf
:# These ensure that TIME_WAIT ports either get reused or closed fast. net.ipv4.tcp_fin_timeout = 1 net.ipv4.tcp_tw_recycle = 1 # TCP memory net.core.rmem_max = 16777216 net.core.rmem_default = 16777216 net.core.netdev_max_backlog = 262144 net.core.somaxconn = 4096 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_max_orphans = 262144 net.ipv4.tcp_max_syn_backlog = 262144 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_syn_retries = 2
Some more debug info:
[root@server node]# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 126767 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
NB That
worker_rlimit_nofile
is set to10240
nginx config.UPDATE 2011 12 09 GMT 19:02.
It looks like the more changes I do, the worse it gets, but here the new config file.
user nginx; worker_processes 4; worker_rlimit_nofile 10240; pid /var/run/nginx.pid; events { worker_connections 2048; #1,353 hits, 2,751 timeouts, 72 errors - Bummer. Try again? #1,408 hits, 2,727 timeouts - Maybe you should increase the timeout? } http { include /etc/nginx/mime.types; error_log /var/www/log/nginx_errors.log warn; # http://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads/ access_log off; open_file_cache max=1000; open_file_cache_valid 30s; client_body_buffer_size 10M; client_max_body_size 200M; proxy_buffers 256 4k; fastcgi_buffers 256 4k; keepalive_timeout 15 15; client_body_timeout 60; client_header_timeout 60; send_timeout 60; port_in_redirect off; server_tokens off; sendfile on; gzip on; gzip_buffers 256 4k; gzip_comp_level 5; gzip_disable "msie6"; map $scheme $php_https { default off; https on; } index index.php; server { server_name ~^www\.(?P<domain>.+); rewrite ^ $scheme://$domain$request_uri? permanent; } include /etc/nginx/conf.d/virtual.conf; }
UPDATE 2011 12 11 GMT 20:11.
This is output of
netstat -ntla
during the test.https://gist.github.com/d74750cceba4d08668ea
UPDATE 2011 12 12 GMT 10:54.
Just to clarify, the
iptables
(firewall) is off while testing.UPDATE 2011 12 12 GMT 22:47.
This is the
sysctl -p | grep mem
dump.net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.sysrq = 0 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 30 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_mem = 8388608 8388608 8388608 net.ipv4.tcp_rmem = 4096 87380 8388608 net.ipv4.tcp_wmem = 4096 65536 8388608 net.ipv4.route.flush = 1 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_max = 8388608 net.core.wmem_default = 65536 net.core.netdev_max_backlog = 262144 net.core.somaxconn = 4096 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_max_orphans = 262144 net.ipv4.tcp_max_syn_backlog = 262144 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_syn_retries = 2
UPDATE 2011 12 12 GMT 22:49
I am using
blitz.io
to run all the tests. The URL I am testing is http://dev.anuary.com/test.txt, using the following command:--region ireland --pattern 200-250:30 -T 1000 http://dev.anuary.com/test.txt
UPDATE 2011 12 13 GMT 13:33
nginx
user limits (set in/etc/security/limits.conf
).nginx hard nofile 40000 nginx soft nofile 40000
-
Gajus over 12 yearsYes, this is my server. ovh.co.uk/dedicated_servers/eg_ssd.xml Nothing that would ramp down DDoS attack. I've also increased
worker_processes
to4
. -
Gajus over 12 yearsJust contacted OVH to double check that there aren't any network level securities implemented on my server. No there aren't.
-
pablo over 12 yearswhat kind of data are you serving from this? html, images, etc?
-
Gajus over 12 years@pablo, simple txt, dev.anuary.com/test.txt to be specific.
-
Tom O'Connor over 12 yearsWhat happens if you turn Keepalive off?
-
Gajus over 12 years@TomO'Connor, no improveoment, well, at least no more than 5%.
-
Tom O'Connor over 12 yearsHummph. That's one idea out then.
-
jood over 12 yearsMight be a stupid question, but what is
ulimit -n
for the user nginx is started from? -
Gajus over 12 years@minaev, see the updated answer (very bottom).
-
SiXoS over 12 yearsI think it would help to run a local benchmark to rule out nginx configuration. Don't you?
-
-
jeffatrackaid over 12 yearsadded updated to main reply due to code.
-
Giovanni Toraldo over 12 yearsplease add the complete top output during the test, you shouldn't check only how much CPU nginx is using.
-
Gajus about 12 yearsThe problem is the same if I have
return 200 "test"
in NGINX. This means that NGINX doesn't even go as far as to call PHP-FPM. -
محمّد محسن احمدی about 11 yearsbe cautious when using net.ipv4.tcp_tw_recycle = 1, generally speaking : not a good idea. reuse is ok tho.
-
BigSack about 11 yearsWhy not use Linux socket instead of localhost?
-
Ryan Angilly about 8 yearsI'm not sure why this has been down voted. Sounds like the right answer to me.