nginx upstream timed out. Multiple servers at the same time

nginx php-fpm timeout

5,743

This is an educated guess. The problem could be caused by exhaustion of local TCP ports for connections to the upstream servers.

You can check the range of allowed ports with:

sysctl net.ipv4.ip_local_port_range

The default on my Debian installation is 32768 - 61000.

You can expand the range with entering the following command as root:

echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range

If you are running a Debian or derived distribution, you can persist this setting across reboots by editing /etc/sysctl.d/99-local.conf and entering this into the file:

net.ipv4.ip_local_port_range = 1024 65535

5,743

Grumpy

Updated on September 18, 2022

Comments

Grumpy almost 2 years

I have several servers serving a single site.

Main server runs nginx and php-fpm. And all the other servers run php-fpm. The server that runs both nginx and php-fpm connects via a unix socket and the others via tcp.

Roughly once an hour (not exactly, sometimes more frequent), there's a strange behavior. All connection of nginx to php-fpm servers timeout. It fails to make a connection.

2014/03/24 04:59:09 [error] 2123#0: *925153 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *926742 connect() to unix:/tmp/php-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-fpm.sock:", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *925159 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.2:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *923874 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *925164 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *909392 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *923098 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2125#0: *923309 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"

As this is a fairly busy site, the log like above gets populated quite fast.

This lasts for roughly 10~15 seconds and everything goes back to normal. Besides the connection timed out errors posted here, there doesn't seem to be any other errors.

I suspect the problem lies with nginx since it happens simultaneously across all the php-fpm servers.

What would cause this? And how could this be resolved?

My nginx config is...

user  nginx;
worker_processes  4;
worker_rlimit_nofile 30000;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections  4096;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;

    keepalive_timeout  5;
    fastcgi_buffers 256 4k;
    gzip on;
    gzip_disable     "msie6";

    fastcgi_cache_path /dev/shm/caches/  levels=1:2 keys_zone=zoneone:4000m max_size=4000m inactive=30m;

    fastcgi_temp_path /var/www/tmp 1 2;
    fastcgi_cache_key "$scheme$proxy_host$request_uri";

    fastcgi_connect_timeout 3s;
    limit_req_zone  $binary_remote_addr  zone=limitone:200m   rate=1r/s;
    limit_req_zone  $binary_remote_addr  zone=limitcomic:500m   rate=40r/m;

    upstream partone {
        server unix:/tmp/php-fpm.sock;
    }

    upstream parttwo {
        server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s;
        server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s;
        server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s;
    }

    upstream parttre {
        server 192.168.1.2:9000 weight=8 max_fails=0 fail_timeout=2s;
        server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s;
        server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s;
        server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s;
    }
... stuff with server, locations and such...
}

You can see that I don't even use all 5 servers in the same context.

nginx version: nginx/1.4.5

Grumpy over 10 years

I think you may be right. I do have a quite a large number of open ports (roughly 2/3rds of limit). Though I haven't run out of them at this very moment (since it doesn't happen all the time), I have increased my port range. And I'll see if it resolves the problem.
Grumpy over 10 years

I've increased the local port range and reduced the total connection count required to the main server. But still seems to be occurring. So, I don't think this is the cause.