Need to increase nginx throughput to an upstream unix socket -- linux kernel tuning?

34,394

Solution 1

It sounds like the bottleneck is the app powering the socket rather than it being Nginx itself. We see this a lot with PHP when used with sockets versus a TCP/IP connection. In our case, PHP bottlenecks much earlier than Nginx ever would though.

Have you checked over the sysctl.conf connection tracking limit, socket backlog limit

  • net.core.somaxconn
  • net.core.netdev_max_backlog

Solution 2

You might try looking at unix_dgram_qlen, see proc docs. Although this may compound the problem by pointing more in the queue? You'll have to look (netstat -x...)

Solution 3

tl;dr

  1. Make sure Unicorn backlog is large (use socket, faster than TCP) listen("/var/www/unicorn.sock", backlog: 1024)
  2. Optimise NGINX performance settings, for example worker_connections 10000;

Discussion

We had the same problem - a Rails app served by Unicorn behind a NGINX reverse proxy.

We were getting lines like these in Nginx error log:

2019/01/29 15:54:37 [error] 3999#3999: *846 connect() to unix:/../unicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: xx.xx.xx.xx, request: "GET / HTTP/1.1"

Reading the other answers we also figured that maybe Unicorn is to blame, so we increased it's backlog, but this did not resolve the problem. Monitoring server processes it was obvious that Unicorn was not getting the requests to work on, so NGINX appeared to be the bottleneck.

Searching for NGINX settings to tweak in nginx.conf this performance tuning article pointed out several settings that could impact how many parallel requests NGINX can process, especially:

user www-data;
worker_processes auto;
pid /run/nginx.pid;
worker_rlimit_nofile 400000; # important

events {    
  worker_connections 10000; # important
  use epoll; # important
  multi_accept on; # important
}

http {
  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 65;
  types_hash_max_size 2048;
  keepalive_requests 100000; # important
  server_names_hash_bucket_size 256;
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
  ssl_prefer_server_ciphers on;
  access_log /var/log/nginx/access.log;
  error_log /var/log/nginx/error.log;
  gzip on;
  gzip_disable "msie6";
  include /etc/nginx/conf.d/*.conf;
  include /etc/nginx/sites-enabled/*;
}

Solution 4

I solved by increasing the backlog number in the config/unicorn.rb... I used to have a backlog of 64.

 listen "/path/tmp/sockets/manager_rails.sock", backlog: 64

and I was getting this error:

 2014/11/11 15:24:09 [error] 12113#0: *400 connect() to unix:/path/tmp/sockets/manager_rails.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 192.168.101.39, server: , request: "GET /welcome HTTP/1.0", upstream: "http://unix:/path/tmp/sockets/manager_rails.sock:/welcome", host: "192.168.101.93:3000"

Now, I increased to 1024 and I don't get the error:

 listen "/path/tmp/sockets/manager_rails.sock", backlog: 1024
Share:
34,394
Ben Lee
Author by

Ben Lee

Updated on September 18, 2022

Comments

  • Ben Lee
    Ben Lee over 1 year

    I am running an nginx server that acts as a proxy to an upstream unix socket, like this:

    upstream app_server {
            server unix:/tmp/app.sock fail_timeout=0;
    }
    
    server {
            listen ###.###.###.###;
            server_name whatever.server;
            root /web/root;
    
            try_files $uri @app;
            location @app {
                    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                    proxy_set_header X-Forwarded-Proto $scheme;
                    proxy_set_header Host $http_host;
                    proxy_redirect off;
                    proxy_pass http://app_server;
            }
    }
    

    Some app server processes, in turn, pull requests off /tmp/app.sock as they become available. The particular app server in use here is Unicorn, but I don't think that's relevant to this question.

    The issue is, it just seems that past a certain amount of load, nginx can't get requests through the socket at a fast enough rate. It doesn't matter how many app server processes I set up.

    I'm getting a flood of these messages in the nginx error log:

    connect() to unix:/tmp/app.sock failed (11: Resource temporarily unavailable) while connecting to upstream
    

    Many requests result in status code 502, and those that don't take a long time to complete. The nginx write queue stat hovers around 1000.

    Anyway, I feel like I'm missing something obvious here, because this particular configuration of nginx and app server is pretty common, especially with Unicorn (it's the recommended method in fact). Are there any linux kernel options that needs to be set, or something in nginx? Any ideas about how to increase the throughput to the upstream socket? Something that I'm clearly doing wrong?

    Additional information on the environment:

    $ uname -a
    Linux servername 2.6.35-32-server #67-Ubuntu SMP Mon Mar 5 21:13:25 UTC 2012 x86_64 GNU/Linux
    
    $ ruby -v
    ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
    
    $ unicorn -v
    unicorn v4.3.1
    
    $ nginx -V
    nginx version: nginx/1.2.1
    built by gcc 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
    TLS SNI support enabled
    

    Current kernel tweaks:

    net.core.rmem_default = 65536
    net.core.wmem_default = 65536
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    net.ipv4.tcp_rmem = 4096 87380 16777216
    net.ipv4.tcp_wmem = 4096 65536 16777216
    net.ipv4.tcp_mem = 16777216 16777216 16777216
    net.ipv4.tcp_window_scaling = 1
    net.ipv4.route.flush = 1
    net.ipv4.tcp_no_metrics_save = 1
    net.ipv4.tcp_moderate_rcvbuf = 1
    net.core.somaxconn = 8192
    net.netfilter.nf_conntrack_max = 524288
    

    Ulimit settings for the nginx user:

    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 20
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 16382
    max locked memory       (kbytes, -l) 64
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 65535
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) unlimited
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited
    
    • Khaled
      Khaled almost 12 years
      Did you check the output of ulimit, specifically number of open files?
    • Ben Lee
      Ben Lee almost 12 years
      @Khaled, ulimit -n says 65535.
  • jmw
    jmw almost 12 years
    Any progress with this?
  • Ben Lee
    Ben Lee almost 12 years
    Thanks for the idea, but this didn't appear to make any difference.
  • Ben Lee
    Ben Lee almost 12 years
    I figured out the problem. See the answer I posted. It actually was the app bottlenecking, not the socket, just as you posit. I had ruled this out earlier due to a mis-diagnosis, but turns out the problem was throughput to another server. Figured this out just a couple hours ago. I'm going to award you the bounty, since you pretty much nailed the source of the problem even despite the mis-diagnosis I put in the question; however, going to give the checkmark to my answer, because my answer describes the exact circumstances so might help someone in the future with a similar issue.
  • Ben Lessani
    Ben Lessani over 11 years
    Your issue isn't nginx, it is more than capable - but that's not to say you might not have a rogue setting. Sockets are particularly sensitive under high load when the limits aren't configured correctly. Can you try your app with tcp/ip instead?
  • Ben Lee
    Ben Lee over 11 years
    same problem with even a worse magnitude using tcp/ip (write queue climbs even faster). I have nginx / unicorn / kernel all set up exactly the same (as far as I can tell) on a different machine, and that other machine is not exhibiting this problem. (I can switch dns between the two machines, to get live load testing, and have dns on a 60-sec ttl)
  • Ben Lee
    Ben Lee over 11 years
    Throughput between each machine and a db machine is the same now, and latency between the new machine and db machine is about 30% more than between old machine and db. But 30% more that a tenth of a millisecond is not the problem.
  • Ben Lee
    Ben Lee over 11 years
    Nope, ulimit settings on both machines are the same (specifically open files is 65535, everything else looks fine too).
  • Ben Lee
    Ben Lee over 11 years
    I added my ulimit settings to the end of the question.
  • tarkeshwar
    tarkeshwar about 11 years
    @BenLee Did you figure this one out? I may be facing similar problem.
  • Ben Lee
    Ben Lee about 11 years
    @tarkeshwar, no, never figured it out. Eventually ended up going with different hardware and and somewhat different server stack instead of solving the problem.