Apache load balancer limits with Tomcat over AJP

43,482

Solution 1

Given that the Apache log illustrates that it cannot connect to Tomcat (from your error log) it would seem that it is the Tomcat application that cannot keep up.

When I was working as a sys admin for a large-ish Tomcat web site I noticed severe performance restrictions, and they weren't down to CPU but synchronisation issues between threads or delays in querying a back-end web service.

The latter was a huge problem because the popular Java HTTP interface limits the number of simultaneous connections to another web server to 2 by default (when I discovered this my jaw dropped). See http://hc.apache.org/httpclient-3.x/threading.html

Does your web app call any other web services?

Solution 2

It looks like Apache is getting a connection timeout connecting to the servers in the pool, which is causing it to be unable to serve the request. Your timeout value looks VERY low, intermittent network latency, or even a page that takes a little extra time to generated, could cause the server to drop out of the pool. I would try higher timeout and retry values, and possibly a higher ping value.

You might also consider switching to the worker or event mpm, the prefork mpm generally has the worst performance.

Dedicated proxy/balancer software, such as squid, might also be a good option.

Share:
43,482

Related videos on Youtube

Peter Sankauskas
Author by

Peter Sankauskas

Software Engineer at AdMobius

Updated on September 17, 2022

Comments

  • Peter Sankauskas
    Peter Sankauskas over 1 year

    I have Apache acting as a load balancer in front of 3 Tomcat servers. Occasionally, Apache returns 503 responses, which I would like to remove completely. All 4 servers are not under significant load in terms of CPU, memory, or disk, so I am a little unsure what is reaching it's limits or why. 503s are returned when all workers are in error state - whatever that means. Here are the details:

    Apache config:

    <IfModule mpm_prefork_module>
      StartServers           30
      MinSpareServers        30
      MaxSpareServers        60
      MaxClients            200
      MaxRequestsPerChild  1000
    </IfModule>
    
    ...
    
    <Proxy *>
      AddDefaultCharset Off
      Order deny,allow
      Allow from all
    </Proxy>
    
    # Tomcat HA cluster
    <Proxy balancer://mycluster>
      BalancerMember ajp://10.176.201.9:8009 keepalive=On retry=1 timeout=1 ping=1
      BalancerMember ajp://10.176.201.10:8009 keepalive=On retry=1 timeout=1 ping=1
      BalancerMember ajp://10.176.219.168:8009 keepalive=On retry=1 timeout=1 ping=1
    </Proxy>
    
    # Passes thru track. or api.
    ProxyPreserveHost On
    ProxyStatus On
    
    # Original tracker
    ProxyPass /m  balancer://mycluster/m
    ProxyPassReverse /m balancer://mycluster/m
    

    Tomcat config:

    <Server port="8005" shutdown="SHUTDOWN">
      <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" />
      <Listener className="org.apache.catalina.core.JasperListener" />
      <Listener className="org.apache.catalina.mbeans.ServerLifecycleListener" />
      <Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener" />
    
      <Service name="Catalina">
        <Connector port="8080" protocol="HTTP/1.1" 
                   connectionTimeout="20000" 
                   redirectPort="8443" />
    
        <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" />
    
        <Engine name="Catalina" defaultHost="localhost">
          <Host name="localhost"  appBase="webapps"
              unpackWARs="true" autoDeploy="true"
              xmlValidation="false" xmlNamespaceAware="false">
        </Engine>
      </Service>
    </Server>
    

    Apache error log:

    [Mon Mar 22 18:39:47 2010] [error] (70007)The timeout specified has expired: proxy: AJP: attempt to connect to 10.176.201.10:8009 (10.176.201.10) failed
    [Mon Mar 22 18:39:47 2010] [error] ap_proxy_connect_backend disabling worker for (10.176.201.10)
    [Mon Mar 22 18:39:47 2010] [error] proxy: AJP: failed to make connection to backend: 10.176.201.10
    [Mon Mar 22 18:39:47 2010] [error] (70007)The timeout specified has expired: proxy: AJP: attempt to connect to 10.176.201.9:8009 (10.176.201.9) failed
    [Mon Mar 22 18:39:47 2010] [error] ap_proxy_connect_backend disabling worker for (10.176.201.9)
    [Mon Mar 22 18:39:47 2010] [error] proxy: AJP: failed to make connection to backend: 10.176.201.9
    [Mon Mar 22 18:39:47 2010] [error] (70007)The timeout specified has expired: proxy: AJP: attempt to connect to 10.176.219.168:8009 (10.176.219.168) failed
    [Mon Mar 22 18:39:47 2010] [error] ap_proxy_connect_backend disabling worker for (10.176.219.168)
    [Mon Mar 22 18:39:47 2010] [error] proxy: AJP: failed to make connection to backend: 10.176.219.168
    [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
    [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
    [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
    [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
    [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
    [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
    

    Load balancer top info:

    top - 23:44:11 up 210 days,  4:32,  1 user,  load average: 0.10, 0.11, 0.09
    Tasks: 135 total,   2 running, 133 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.2%id,  0.1%wa,  0.0%hi,  0.1%si,  0.3%st
    Mem:    524508k total,   517132k used,     7376k free,     9124k buffers
    Swap:  1048568k total,      352k used,  1048216k free,   334720k cached
    

    Tomcat top info:

    top - 23:47:12 up 210 days,  3:07,  1 user,  load average: 0.02, 0.04, 0.00
    Tasks:  63 total,   1 running,  62 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 99.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:   2097372k total,  2080888k used,    16484k free,    21464k buffers
    Swap:  4194296k total,      380k used,  4193916k free,  1520912k cached
    

    Catalina.out does not have any error messages in it.

    According to Apache's server status, it seems to be maxing out at 143 requests/sec. I believe the servers can handle substantially more load than they are, so any hints about low default limits or other reasons why this setup would be maxing out would be greatly appreciated.

    • Admin
      Admin about 14 years
      Do you have a DB? What is the load on the DB. Are you monitoring the network traffic? what about network errors? Have you run a load test of your network, application, database server? Have you run a load test of the application via tomcat and not apache? what are the differences? have you run thread dumps and compare them and see what the threads are waiting? have you configure the apache status page?
    • Admin
      Admin about 14 years
      No database is used. Network traffic goes up and down, but max is 1Mb/sec in, 100k/sec out. We see no network errors. Apache status is on, with an avg of 37 thread waiting, 3 sending reply, 25 closing connection. I can run JMeter and pump another 300 req/s through the system without issue, but the strange part is Apache server status doesn't show the extra JMeter traffic, so I wonder if the server-status is correct. It seems like it can handle more traffic, but randomly returns 503s when all worker threads are busy.
  • Peter Sankauskas
    Peter Sankauskas about 14 years
    Yes the majority of the time the Tomcat servers are fine and do what they are suppose to. I am wondering if there is some Apache or Debian config limit that it being hit.
  • Peter Sankauskas
    Peter Sankauskas about 14 years
    Switching to the worker thread is definitely an option, but I would like to figure out what the underlying cause of the 503s is.
  • Kalpesh Lakhani
    Kalpesh Lakhani about 14 years
    Looking at your post more carefully, it looks like Apache is getting a connection timeout connecting to the servers in the pool, which is causing it to be unable to serve the request. Your timeout value looks VERY low, intermittent network latency, or even a page that takes a little extra time to generated, could cause the server to drop out of the pool. I would try higher timeout and retry values, and possibly a higher ping value. I've updated my answer to contain these suggestions.