Nginx as a load balancer. Frequent upstream timed out (110: Connection timed out) while connecting to upstream

9,196

I ran into something like this in a nginx<->apache2 setup. It was apache taking too long under load due to MySQL bogging down. To find out how long apache was taking I changed the log format to:

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %DµSEC" timed

and the nginx log to:

log_format timed_combined '$remote_addr - $remote_user [$time_local]  '

Then it became easier to see that although apache was completing all the requests, it was very late (by many seconds)in delivering data back to nginx.

I'm not sure why haproxy has helped your situation unless one apache server is much slower than the others. That can happen with identical machines when one machine is having recoverable disk errors. The errors should show up in the syslogs.

Share:
9,196

Related videos on Youtube

Jchieppa
Author by

Jchieppa

Updated on September 18, 2022

Comments

  • Jchieppa
    Jchieppa over 1 year

    I'm trying to use nginx on a centos 7 virtual machine as a load balancer to replace an aging Coyote Point hardware device. However in one of our webapps we're seeing frequent and continual upstream timeout errors in the logs and clients are reporting about session issues while using the system.

    Here are the relevant bits from our nginx.conf

    user  nginx;
    worker_processes  4;
    
    error_log  /var/log/nginx/error.log warn;
    pid        /var/run/nginx.pid;
    
    
    events {
        worker_connections  1024;
    }
    
    upstream farm {
       ip_hash;
    
       server www1.domain.com:8080;
       server www2.domain.com:8080 down;
       server www3.domain.com:8080;
       server www4.domain.com:8080;
    }
    
    server {
            listen 192.168.1.87:80;
            server_name host.domain.com;
            return         301 https://$server_name$request_uri;
    }
    
    server {
        listen 192.168.1.87:443 ssl;
        server_name host.domain.com;
    
        ## Compression
        gzip              on;
        gzip_buffers      16 8k;
        gzip_comp_level   4;
        gzip_http_version 1.0;
        gzip_min_length   1280;
        gzip_types        text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript image/x-icon image/bmp;
        gzip_vary         on;
    
        tcp_nodelay on;
        tcp_nopush on;
        sendfile off;
    
        location / {
               proxy_connect_timeout   10;
               proxy_send_timeout      180;
               proxy_read_timeout 180; #to allow for large managers reports
               proxy_buffering off;
               proxy_buffer_size   128k;
               proxy_buffers   4 256k;
               proxy_busy_buffers_size   256k;
               proxy_set_header Host $host;
               proxy_set_header X-Real-IP $remote_addr;
               proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
               proxy_pass http://farm;
    
               location ~* \.(css|jpg|gif|ico|js)$ {
                            proxy_cache mypms_cache;
                    add_header X-Proxy-Cache $upstream_cache_status;
                            proxy_cache_valid 200 60m;
                            expires 60m;
                            proxy_pass http://farm;
                     }
     }
    
     location /basic_status {
        stub_status;
        }
    
        error_page 502 502 = /maintenance.html;
        location = /maintenance.html {
        root /www/;
     }
    }
    

    In the logs I frequently see entries like

    2015/03/13 15:22:58 [error] 4482#0: *557390 upstream timed out (110: Connection timed out) while connecting to upstream, client: 72.160.92.101, server: host.domain.com, request: "GET /tapechart.php HTTP/1.1", upstream: "http://192.168.1.50:8080/tapechart.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
    2015/03/13 15:23:14 [error] 4481#0: *557663 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=3105543&show=0 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=3105543&show=0", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3105543&show=0"
    2015/03/13 15:23:19 [error] 4481#0: *557550 upstream timed out (110: Connection timed out) while connecting to upstream, client: 50.134.133.213, server: host.domain.com, request: "GET /tbltapechart.php?numNights=30&startDate=1-Aug-2015&roomTypeID=-1&hideNav=N&bookingID=&roomFilter=-1 HTTP/1.1", upstream: "http://192.168.1.50:8080/tbltapechart.php?numNights=30&startDate=1-Aug-2015&roomTypeID=-1&hideNav=N&bookingID=&roomFilter=-1", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
    2015/03/13 15:23:37 [error] 4483#0: *561705 upstream timed out (110: Connection timed out) while connecting to upstream, client: 74.223.167.14, server: host.domain.com, request: "GET /js/multiselect/jquery.multiselect.filter.css HTTP/1.1", upstream: "http://192.168.1.55:8080/js/multiselect/jquery.multiselect.filter.css", host: "host.domain.com", referrer: "https://host.domain.com/fdhome.php"
    2015/03/13 15:23:40 [error] 4481#0: *561099 upstream timed out (110: Connection timed out) while connecting to upstream, client: 74.223.167.14, server: host.domain.com, request: "GET /img/tabs_left_bc.jpg HTTP/1.1", upstream: "http://192.168.1.55:8080/img/tabs_left_bc.jpg", host: "host.domain.com", referrer: "https://host.domain.com/fdhome.php"
    2015/03/13 15:23:45 [error] 4481#0: *557214 upstream timed out (110: Connection timed out) while connecting to upstream, client: 75.37.141.182, server: host.domain.com, request: "GET /tapechart.php HTTP/1.1", upstream: "http://192.168.1.50:8080/tapechart.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
    2015/03/13 15:23:52 [error] 4482#0: *557330 upstream timed out (110: Connection timed out) while connecting to upstream, client: 173.164.149.18, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434"
    2015/03/13 15:24:14 [error] 4481#0: *557663 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=3105543&show=0 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=3105543&show=0", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3105543&show=0"
    2015/03/13 15:24:15 [error] 4481#0: *557752 upstream timed out (110: Connection timed out) while connecting to upstream, client: 24.158.4.70, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=2070569 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=2070569", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
    2015/03/13 15:24:15 [error] 4482#0: *558613 upstream timed out (110: Connection timed out) while connecting to upstream, client: 199.102.121.3, server: host.domain.com, request: "GET /rptlanding.php HTTP/1.1", upstream: "http://192.168.1.50:8080/rptlanding.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
    2015/03/13 15:24:17 [error] 4482#0: *557353 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /js/multiselect/demo/assets/prettify.js HTTP/1.1", upstream: "http://192.168.1.50:8080/js/multiselect/demo/assets/prettify.js", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3186044"
    

    I initially found I had to set such a high proxy_read_timeout because we have 1 report that is very large and takes at least 20 seconds to fully render for users with a moderate dataset. Our users with the largest dataset can take up to 2 minutes to render the report. However it's very infrequently run, usually less than once per day by uses and has never been the URL in the GET string in the logs.

    The four backend servers are identical Apache servers all running httpd 2.2.29 and php 5.5.22 built from source and all are on the same version of centos and up to date. As I had originally seen a MaxClients hit in the logs I defined the following on each Apache host

    <IfModule mpm_prefork_module>
        StartServers          10
        MinSpareServers       10
        MaxSpareServers      20
        MaxClients          200
        MaxRequestsPerChild   300
    </IfModule>
    

    The nginx server and the apache servers are all located in the same datacenter, on the same subnet and vlan and I see nothing in the error_log on the apache server side indicating a reason for the timeouts.

    Additional things we've tried to troubleshoot this include

    • Upping the proxy_read_timeout to 300.
    • Removing the Gzip settings.
    • Removing the location block for css, images and javascript caching.
    • Enabling proxy_buffering. It's disabled due to the large report to allow nginx to start serving it as rendered (including a building report javascript progress indicator) rather than displaying a blank page for 20 - 120 seconds.
    • Adding KeepAlive 8 / 16 / 32 / 64 to the upstream.

    At this point I'm dubious that it's a network problem or a problem with the backend as I've moved the webapp back to the coyote point load balancer and the complaints have dropped off.

    I'd really love to figure this out but I'm kind of stumped on where to go from here. Advice please?

    • jeffreyveon
      jeffreyveon almost 9 years
      Facing this exact issue - were you able to resolve this?
    • Jchieppa
      Jchieppa almost 9 years
      @jeffreyveon I'm afraid I never figured out the root cause of this. I ended up sticking HAProxy between Nginx and the backend servers so it went Client <-> nginx <-> haproxy <-> backend. It had the added side benefit of active polling of backend nodes, which in my opinion is much better than discovering a failure on a real request.