NGINX timeout after +200 concurrent connections

centos nginx linux concurrency

24,534

Solution 1

You will need to dump your network connections during the test. While the server may have near zero load, your TCP/IP stack could be billing up. Look for TIME_WAIT connections in a netstat output.

If this is the case, then you will want to check into tuning tcp/ip kernel paramters relating to TCP Wait states, TCP recyling, and similar metrics.

Also, you have not described what is being tested.

I always test:

static content (image or text file)
simple php page (phpinfo for example)
application page

This may not apply in your case but is something I do when performance testing. Testing different types of files can help you pinpoint the bottlneck.

Even with static content, testing different size of files is important as well to get timeouts and other metrics dialed in.

We have some static content Nginx boxes handling 3000+ active connections. So it Nginx can certainly do it.

Update: Your netstat shows a lot of open connections. May want to try tuning your TCP/IP stack. Also, what file are you requesting? Nginx should quickly close the port.

Here is a suggestion for sysctl.conf:

net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

These values are very low but I have had success with them on high concurrency Nginx boxes.

Solution 2

I was having a very similar issue with a nginx box serving as load balancer with an upstream of apache servers.

In my case I was able to isolate the problem to be networking related as the upstream apache servers became overloaded. I could recreate it with simple bash scripts while the overall system was under load. According to an strace of one of the hung processes the connect call was getting an ETIMEDOUT.

These settings (on the nginx and upstream servers) eliminated the problem for me. I was getting 1 or 2 timeouts per minute before making these changes (boxes handling ~100 reqs/s) and now get 0.

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 20480
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 400000
net.core.somaxconn = 4096

I would not recommend using net.ipv4.tcp_tw_recycle or net.ipv4.tcp_tw_reuse, but if you want to use one go with the latter. They can cause bizarre issues if there is any kind of latency at all and the latter is at least the safer of the two.

I think having tcp_fin_timeout set to 1 above may be causing some trouble as well. Try putting it at 20/30 - still far below the default.

Solution 3

maybe is not nginx problem, while you test on blitz.io do a :

tail -f /var/log/php5-fpm.log

(thats what i am using to handle the php)

this triggers an error and the timeouts starts to raise:

WARNING: [pool www] server reached pm.max_children setting (5), consider raising it

so, put more max_children on fmp conf and its done! ;D

Solution 4

Yet another hypothesis. You have increased worker_rlimit_nofile, but the maximum number of clients is defined in the documentation as

max_clients = worker_processes * worker_connections

What if you try to raise worker_connections to, like, 8192? Or, if there's enough CPU cores, increase worker_processes?

View more solutions

24,534

Gajus

Updated on September 18, 2022

Comments

Gajus over 1 year

This is my nginx.conf (I've updated config to make sure that there is no PHP involved or any other bottlenecks):

user                nginx;
worker_processes    4;
worker_rlimit_nofile 10240;

pid                 /var/run/nginx.pid;

events
{
    worker_connections  1024;
}

http
{
    include             /etc/nginx/mime.types;

    error_log           /var/www/log/nginx_errors.log warn;

    port_in_redirect    off;
    server_tokens       off;
    sendfile            on;
    gzip                on;

    client_max_body_size 200M;

    map $scheme $php_https { default off; https on; }

    index index.php;

    client_body_timeout   60;
    client_header_timeout 60;
    keepalive_timeout     60 60;
    send_timeout          60;

    server
    {
        server_name dev.anuary.com;

        root        "/var/www/virtualhosts/dev.anuary.com";
    }
}

I am using http://blitz.io/play to test my server (I bought the 10 000 concurrent connections plan). In a 30 seconds run, I get 964 hits and 5,587 timeouts. The first timeout happened at 40.77 seconds into the test when the number of concurrent users was at 200.

During the test, the server load was (top output):

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                               20225 nginx     20   0 48140 6248 1672 S 16.0  0.0   0:21.68 nginx                                                                  
    1 root      20   0 19112 1444 1180 S  0.0  0.0   0:02.37 init                                                                   
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                               
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.03 migration/0

Therefore it is not server resource issue. What is it then?

UPDATE 2011 12 09 GMT 17:36.

So far I did the following changes to make sure that the bottleneck is not TCP/IP. Added to /etc/sysctl.conf:

# These ensure that TIME_WAIT ports either get reused or closed fast.
net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_tw_recycle = 1
# TCP memory
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 4096

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2

Some more debug info:

[root@server node]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 126767
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

NB That worker_rlimit_nofile is set to 10240 nginx config.

UPDATE 2011 12 09 GMT 19:02.

It looks like the more changes I do, the worse it gets, but here the new config file.

user                nginx;
worker_processes    4;
worker_rlimit_nofile 10240;

pid                 /var/run/nginx.pid;

events
{
    worker_connections  2048;
    #1,353 hits, 2,751 timeouts, 72 errors - Bummer. Try again?
    #1,408 hits, 2,727 timeouts - Maybe you should increase the timeout?
}

http
{
    include             /etc/nginx/mime.types;

    error_log           /var/www/log/nginx_errors.log warn; 

    # http://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads/
    access_log              off;

    open_file_cache         max=1000;
    open_file_cache_valid   30s;

    client_body_buffer_size 10M;
    client_max_body_size    200M;

    proxy_buffers           256 4k;
    fastcgi_buffers         256 4k;

    keepalive_timeout       15 15;

    client_body_timeout     60;
    client_header_timeout   60;

    send_timeout            60;

    port_in_redirect        off;
    server_tokens           off;
    sendfile                on;

    gzip                    on;
    gzip_buffers            256 4k;
    gzip_comp_level         5;
    gzip_disable            "msie6";



    map $scheme $php_https { default off; https on; }

    index index.php;



    server
    {
        server_name ~^www\.(?P<domain>.+);
        rewrite     ^ $scheme://$domain$request_uri? permanent;
    }

    include /etc/nginx/conf.d/virtual.conf;
}

UPDATE 2011 12 11 GMT 20:11.

This is output of netstat -ntla during the test.

https://gist.github.com/d74750cceba4d08668ea

UPDATE 2011 12 12 GMT 10:54.

Just to clarify, the iptables (firewall) is off while testing.

UPDATE 2011 12 12 GMT 22:47.

This is the sysctl -p | grep mem dump.

net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_mem = 8388608 8388608 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.ipv4.route.flush = 1
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_max = 8388608
net.core.wmem_default = 65536
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 4096
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2

UPDATE 2011 12 12 GMT 22:49

I am using blitz.io to run all the tests. The URL I am testing is http://dev.anuary.com/test.txt, using the following command: --region ireland --pattern 200-250:30 -T 1000 http://dev.anuary.com/test.txt

UPDATE 2011 12 13 GMT 13:33

nginx user limits (set in /etc/security/limits.conf).

nginx       hard nofile 40000
nginx       soft nofile 40000

Gajus over 12 years

Yes, this is my server. ovh.co.uk/dedicated_servers/eg_ssd.xml Nothing that would ramp down DDoS attack. I've also increased worker_processes to 4.
Gajus over 12 years

Just contacted OVH to double check that there aren't any network level securities implemented on my server. No there aren't.
pablo over 12 years

what kind of data are you serving from this? html, images, etc?
Gajus over 12 years

@pablo, simple txt, dev.anuary.com/test.txt to be specific.
Tom O'Connor over 12 years

What happens if you turn Keepalive off?
Gajus over 12 years

@TomO'Connor, no improveoment, well, at least no more than 5%.
Tom O'Connor over 12 years

Hummph. That's one idea out then.
jood over 12 years

Might be a stupid question, but what is ulimit -n for the user nginx is started from?
Gajus over 12 years

@minaev, see the updated answer (very bottom).
SiXoS over 12 years

I think it would help to run a local benchmark to rule out nginx configuration. Don't you?

jeffatrackaid over 12 years

added updated to main reply due to code.
Giovanni Toraldo over 12 years

please add the complete top output during the test, you shouldn't check only how much CPU nginx is using.
Gajus about 12 years

The problem is the same if I have return 200 "test" in NGINX. This means that NGINX doesn't even go as far as to call PHP-FPM.
محمّد محسن احمدی about 11 years

be cautious when using net.ipv4.tcp_tw_recycle = 1, generally speaking : not a good idea. reuse is ok tho.
BigSack about 11 years

Why not use Linux socket instead of localhost?
Ryan Angilly about 8 years

I'm not sure why this has been down voted. Sounds like the right answer to me.