Troubleshooting Site Slowness on a Nginx + Gunicorn + Django Stack

12,941

Solution 1

That's a lot of sites to host on a server with only 1GB of RAM. You're at nearly 100% memory utilization, and the numbers you have are probably "standby" numbers. The RAM usage of each process can and will balloon in the process of serving requests. Right off the bat, you need to add more RAM to this instance and, better, move some of the sites off onto another server.

As to your questions:

  1. Where'd you get the idea that sites become "inactive" and Gunicorn, then, has to load the site again? That's rubbish. As long as the Gunicorn process is running (i.e. not terminated manually or by an error on the site) it remains fully initialized and ready to go, whether it's been an hour or a month.

  2. You're hacking at the leaves here, leaving the root untouched. There's nothing out of the ordinary with the memory usage of each Gunicorn process. It needs RAM to run. Your problem is trying to run too much on a severly underpowered server. No optimization is going to save you here. You need more RAM or more servers. Probably both.

  3. No need. Again, the problem is already identified. Pretty clearly in fact by the numbers you posted.

  4. There's no way to reliably know which processes are getting swapped. It changes every second and depends on which are actively running and need more RAM and which are inactive or simply not as active. When your server is this strapped for resources, it's spending half it's time just figuring out which process to juggle next, especially if they're all active and vying for resources.

  5. Yes. Gunicorn recommends 2*cores+1. So on a dual-core system, that's 5; on a quad-core, 9. However, there's no way you could run even 5 workers for each of these sites on this one system. You can't even run 1 worker for each reliably.

  6. It depends on the "things". But, when multiple sites are hosted on the same server, those servers are beasts spec-wise. On a small, probably VPS instance like you have, especially with only 1GB of RAM, one site is pretty much your limit. Two, maybe.

Solution 2

Regarding:

Regarding your answer to 5, I believe what Gunicorn recommends is overkill.

I recently performed some ad-hoc testing with the number of workers and found that, assuming you have enough RAM, that that 2*cores+1 rule of thumb is pretty accurate. I found that requests/sec increased almost linearly until I got close to that number, then dropped off as the OS started to thrash.

Since results depend greatly on workload, try different values and see where your performance peaks.

Solution 3

1) Not sure what you mean by inactive? As in, disabled by nginx? Or just too slow to work?

2 and 3) django-debug-toolbar and django-debug-logging will be a good place to start. If this doesn't help, it's time to move to server-level profiling to see which processes are causing the problem.

4) Use top: How to find out which processes are swapping in linux?

5) Yes - benchmarking. Pick a benchmarking tool (e.g. apachebench) and run tests against your current configuration. Tweak something. Run the tests again. Repeat until your performance problems are gone! For best results, use traffic which is similar to your live traffic (in terms of URL distribution, GET/POST, etc).

6) Yes, at both the nginx and app levels. You will probably get most benefit by profiling each site and improving its memory usage (see 2).

Share:
12,941
Brent O'Connor
Author by

Brent O'Connor

Updated on June 22, 2022

Comments

  • Brent O'Connor
    Brent O'Connor almost 2 years

    Issue I Was Having

    I was having an issue where some sites were taking a long time to load (By "long time" I mean up to 16 seconds). Sometimes they might timeout entirely, which generated a Nginx 504 error. Usually, when a site timed out I could reload the site again and it would load quickly. The site that I was having issues with gets a very low amount of traffic. I'm testing the site by loading the Django admin index page in order to try and eliminate any slowness that could be caused because of poor code. It should also be noted that this particular site only uses the Django admin because it's an intranet-type site for staff only.

    Hosting Setup

    All the sites I'm hosting are on two Rackspace cloud servers. The first server is my app server, which has 1024 MB of RAM, and my second server is my database server, which has 2048 MB of RAM. The app server is serving up each site using Nginx, which serves all static files and proxies everything else to the Django Gunicorn workers for each site.

    When looking at the database servers RAM and CPU load it seems like everything is fine on the database server.

    $ free -m
                 total       used       free     shared    buffers     cached
    Mem:          1999       1597        402          0        200       1007
    -/+ buffers/cache:        389       1610
    Swap:         4094          0       4094
    
    
    Top shows a CPU load average of: 0.00, 0.01, 0.05
    

    In order to try and troubleshoot what is happening, I wrote a quick little script which prints out the memory usage on the app server.

    Example print out with the site domains anonymized:

    Celery:     23 MB
    Gunicorn:  566 MB
    Nginx:       8 MB
    Redis:     684 KB
    Other:      73 MB
    
                 total       used       free     shared    buffers     cached
    Mem:           993        906         87          0         19         62
    -/+ buffers/cache:        824        169
    Swap:         2047        828       1218
    
    Gunicorn memory usage by webste:
    site01.example.com    31 MB
    site02.example.com    19 MB
    site03.example.com     7 MB
    site04.example.com     9 MB
    site05.example.com    47 MB
    site06.example.com    25 MB
    site07.example.com    14 MB
    site08.example.com    18 MB
    site09.example.com    27 MB
    site10.example.com    15 MB
    site11.example.com    14 MB
    site12.example.com     7 MB
    site13.example.com    18 MB
    site14.example.com    18 MB
    site15.example.com    10 MB
    site16.example.com    25 MB
    site17.example.com    13 MB
    site18.example.com    18 MB
    site19.example.com    37 MB
    site20.example.com    30 MB
    site21.example.com    23 MB
    site22.example.com    28 MB
    site23.example.com    80 MB
    site24.example.com    15 MB
    site25.example.com     5 MB
    

    Example Gunicorn config file:

    pidfile = '/var/run/gunicorn_example.com.pid'
    proc_name = 'example.com'
    workers = 1
    bind = 'unix:/tmp/gunicorn_example.com.sock'
    

    Example Nginx config:

    upstream example_app_server {
        server unix:/tmp/gunicorn_example.com.sock fail_timeout=0;
    }
    
    server {
    
        listen       80;
        server_name  example.com;
        access_log   /var/log/nginx/example.com.access.log;
        error_log    /var/log/nginx/example.com.error.log;
    
        location = /favicon.ico {
            return  404;
        }
    
        location  /static/ {
            root  /srv/sites/example/;
        }
    
        location  /media/ {
            root  /srv/sites/example/;
        }
    
        location  / {
            proxy_pass            http://example_app_server;
            proxy_redirect        off;
            proxy_set_header      Host             $host;
            proxy_set_header      X-Real-IP        $remote_addr;
            proxy_set_header      X-Forwarded-For  $proxy_add_x_forwarded_for;
            client_max_body_size  10m;
        }
    
    }
    

    As you can see, there is a lot of memory that is swapped, so in order to fix my issues I upgraded the ram on the app server, which fixed the sites' slowness entirely. Even though I was able to fix the issue, it took me a lot longer than I would like and I still feel like I was basically guessing at what was causing the site slowness. All this leads me to my questions...

    Questions

    1. How can you tell site slowness on a low traffic site isn't caused by site inactivity which causes the site to become inactive, which then causes Gunicorn to have to load the site again after the site has gone inactive? Is there a setting to prevent a site from going inactive?
    2. It seems like I have some sites that are taking too much memory. What are some tools I could use to reduce how much memory a site is using? Should I use some Python profiling tools?
    3. What are some tools and steps to take in order to determine at what level in the stack the bottleneck is occurring?
    4. What is the best way to determine if it's your Gunicorn processes that are getting swapped or if it's other processes that are getting swapped?
    5. Most of the sites I'm hosting don't get a ton of traffic so I'm using just one Gunicorn worker. Is there a more scientific way for determining and adjusting how many Gunicorn workers you have on a site?
    6. When hosting multiple sites on the same server, are there ways to configure things to use less memory?
  • Brent O'Connor
    Brent O'Connor about 12 years
    I'm guessing you missed this ... "Even though I was able to fix the issue, it took me a lot longer than I would like and I still feel like I was basically guessing at what was causing the site slowness. All this leads me to my questions..." Generally speaking, my questions were an attempt to get a listing of steps and tools that people use when troubleshooting slowness on a Django stack. Regarding your answer to 5, I believe what Gunicorn recommends is overkill. The sites I'm hosting are running fine with one worker, it depends on your traffic. I was looking for something more scientific.
  • Brent O'Connor
    Brent O'Connor about 12 years
    This is what I'm talking about though, unless your site gets a lot of traffic, this is overkill. I have one worker on all my sites and they have speedy response times. Going with what Gunicorn recommends would require 3 or 4 times the Ram. If your site doesn't have the traffic then why waste the Ram. I just wish there was a more scientific way to adjust your workers (i.e. if you get X pageviews use X workers).
  • Brent O'Connor
    Brent O'Connor about 12 years
    1) Slow to respond. In my case this was because the worker was getting swapped.
  • Brent O'Connor
    Brent O'Connor about 12 years
    2) I do that, but that's not practical on a live server?
  • glarrain
    glarrain about 11 years
    @Brent "Why waste the RAM" doesn't make much sense. If you have a server running, get the most performance you can out of it (obviously with safe boundaries)!
  • ron rothman
    ron rothman about 7 years
    @Brent if you get X pageviews use X workers is the opposite of scientific. Science involves experimenting/measuring, not simplified rules of thumb.
  • EralpB
    EralpB over 6 years
    @glarrain yeah, NOT using the RAM is wasting the RAM.