Why do MySQL queries pile up in "Sending data" state?

mysql performance innodb

30,216

Solution 1

This turned out to be a flaw in the combination of innodb_file_per_table, default-storage-engine = innodb, and a frequently-accessed page that created a temporary table. Each time a connection closed, it would drop the table, discarding pages from the buffer pool LRU. This would cause the server to stall for a bit, but never on the query that was actually causing the problem.

Worse, the innodb_file_per_table setting had been languishing in our my.cnf file for months before the server had to be restarted for an entirely unrelated reason, during which time we had been using those temporary tables without a problem. (The NOC suddenly took down the DNS server, causing every new connection to hang because we hadn't enabled skip-name-resolve, and wouldn't admit for hours that anything had changed.)

Fortunately, we were able to rewrite the offending page to use an even faster set of queries that loaded most of the work onto the front-end web servers, and haven't seen a problem since.

Solution 2

Well, do note that if I recall well (it's been a while since I did DB work) COUNT(*) queries without a WHERE clause on innodb tables are notoriously slower than on MyISAM and Memory tables.

Also, is this by any chance a Xen DomU?

What is the frontend language? If PHP, is it using MySQL or MySQLi? Are they using persistent connections?

You have not mentionned the underlying operating system, but in the case of Linux I would start by staring at the output of free -m, paying special attention to the last two lines to see if memory is tight overall.

[0:504] callisto:cyanotype $ free -m
             total       used       free     shared    buffers     cached
Mem:          3961       3816        144          0        184       1454
-/+ buffers/cache:       2177       1784
Swap:         2898          0       2898

Here we have a system that's healthy (it is my workstation). The second column excludes buffers and cache, so I am in fact using 2177mb of memory, and have 1784 megabytes readily available.

Last line shows that I don't use swap at all so far.

Then giving vmstat(8), to see if your system is trashing like mad would be useful, too.

[0:505] callisto:cyanotype $ vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 134116 189828 1499948    0    0    11     3   44   49  1  1 98  0
 0  0      0 143112 189836 1489688    0    0     0     6  526 2177  1  1 98  0
 0  0      0 139268 190504 1491864    0    0   512     4  663 4704  2  1 96  1
 2  0      0 136688 191084 1493484    0    0   473     5  641 3039  1  1 97  1
 0  0      0  52636 191712 1518620    0    0  5066     4 1321 6600  8  2 86  4
 5  0      0  72992 193264 1377324    0    0 10742    31 1602 7441 12  3 80  5
 2  1      0  84036 193896 1202012    0    0 10126    43 2621 4305 31  2 57 10
 3  0      0  42456 195812 1060904    0    0  3970    75 55327 9806 43 5 41 10
 8  1      0  34620 197040 942940     0    0  3554    64 50892 12531 43 6 44 6
^C
[0:506] callisto:cyanotype $

(My desktop really isn't doing all that much here, sorry. What a waste of 8 perfectly good cores)

If you see a lot of process spending time in the 'b' column, that means they are blocked, waiting for something. Often that is IO. The important columns here are si and so. Check if they're populated with high values. If so, this may be your problem -- something is consuming a lot of memory, more than you can actually affort. Using top(4) and ordering the columns by memory % (shift+m while in top) might show the culprit(s).

It's not impossible that your system is trashing to and from swap, and saturating the disks, causing blocked threads and processes.The tool iostat(8)(part of package sysstat, usually) should be given a whirl to see if you have processes that are blocked, stuck on IO_WAIT. A saturated disk can spell bad news for the whole system under high load, especially if the system is swapping a lot.

You might run iostat with extended stats, every five seconds, for instance:

[0:508] callisto:cyanotype $ iostat -x 5
Linux 2.6.35-23-generic (callisto)  2010-11-30  _x86_64_    (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16,55    0,12    2,70    2,60    0,00   78,02

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm      %util
sdc               0,00     2,00    1,00    0,80    27,20    22,40    27,56     0,01    3,33   3,33       0,60
sdd               0,00    12,60   67,60    4,80  4222,40   139,20    60,24     0,62    8,62   3,29      23,80
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00       0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32,02    0,10    1,83    0,44    0,00   65,61

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0,60     3,20   11,00    0,80   265,60    32,00    25,22     0,05    3,90   2,88   3,40
sdd               0,00     8,20    0,00    3,00     0,00    89,60    29,87     0,02    8,00   7,33   2,20
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          49,26    0,22    3,12    0,12    0,00   47,28

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               6,20     3,00    7,40    3,80   208,00    54,40    23,43     0,09    7,86   2,50   2,80
sdd               0,00    15,20    0,20    4,00     1,60   152,00    36,57     0,03    6,67   6,19   2,60
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16,00    0,54    1,05    1,07    0,00   81,35

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               4,20     0,00   31,40    0,00  3204,80     0,00   102,06     0,17    4,90   2,68   8,40
sdd               0,00    28,20    0,20    2,60     1,60   246,40    88,57     0,02    7,14   7,14   2,00
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

^C

This should allow you to easily see if your volumes are being saturated. For instance here, you can see that my disks are terribly underutilized, that the system spends most of its cpu cycles idling, etc etc. If that percentage is mostly in the % IOWAIT column, well you have an IO bottleneck here. You probably already know all this, but just covering all the bases to make sure.

The idea is that your config file changed, and you have no history of it (putting your config files under version control is a great idea for that very reason) -- and it is not impossible the size of a buffer suddendly changed thus making expensive queries like COUNT(*) without SELECT suddendly start to gobble up ressources.

Based on what you have learned from the previous use of the tools abive -- you should probably inspect the configuration file (being the only thing that changed, it is very likely the culprit) to see if the buffer values are sane for your average load.

How large are the buffers, like the query_cache_size value, and especially the sort_buffer sizes? (If that doesn't fit in memory, it will performed on disk, at a massive cost as I'm sure you can imagine).

How large is the innodb_buffer_pool_size?

How large is the table_cacheand most importantly, does that value fits within the system limits for file handles? (both open-files-limit in [mysqld] and at the OS level).

Also, I don't remember off the top of my head if this is still true, but I'm fairly certain that innodb actually locks the entire table whenever it has to commit an auto-increment fields. I googled and I could not find if that was still true or not.

You could also use innotop(1) to see what's going on more in detail, too.

I hope this helps somehow or gives you a starting point :)

30,216

eswald

Updated on September 17, 2022

Comments

eswald over 1 year

We're using InnoDB tables as the back end of a web application, and everything was fine for about two years until we had to restart MySQL a few weeks ago. (We hadn't disabled reverse DNS lookups, even though we weren't really using them, but our hosting system suddenly stopped responding to those requests. They're now disabled.) Unfortunately, the configuration file had changed, and we don't have a copy of its original state for comparison.

After fixing the most significant problems, we're left with a real puzzler: Under high load, the database queries start taking much longer than usual. During such times, we have several hundred open connections from our seven apache servers. Running SHOW PROCESSLIST reveals that half or more of those connections are in the "Sending data" state, frequently with times of a few hundred seconds. Almost all of their queries are SELECT, with similar queries tending to clump together. In fact, the lowest clump in the list has tended to be the exact same query (I would expect it to be in the query cache), returning 1104 rows of two integers each. Other frequent offenders are lists of a few hundred single-integer rows, several single-integer rows, or even a single COUNT(*) result.

We tried shutting down the web servers during one of these periods, but the problem returned within a minute of restarting them. However, completely restarting mysqld resolved the problem until the next day. What could the problem be, and how can we verify and/or fix it?
- Khaled over 13 years
  
  The problem can be related to the web servers. They can be establishing more and more connections without getting the queries results and disconnecting or re-using the existing connection. I think checking the web servers performance will be a good point to look at.
- eswald over 13 years
  
  I would have thought so, too, until we discovered that restarting the database server fixed the problem. However, the "Sending data" state could very well be the result of the web servers giving up without disconnecting; in that case, it's a symptom of a much deeper issue. Where should I look?
eswald over 13 years

Sorry; I forgot to specify that the COUNT(*) queries were using a WHERE. The purpose was to point out how little information would actually be sent.
eswald over 13 years

I don't believe any hypervisor is in use, but it's not impossible. If so, I have no access to it, though the system does seem to have direct access to the hardware.
eswald over 13 years

Operating system: Red Hat Enterprise Linux ES release 4 (Nahant Update 5)
eswald over 13 years

The swap space is almost unused at the moment, with (only!) about 40% of the physical memory used by MySQL. I wish I had better stats on that for the slow time, but it didn't look like swap was thrashing. Thank you for the command hints to be certain, though.
eswald over 13 years

query_cache_size: 67108864 (64M) sort_buffer_size: 524280 (512K) innodb_buffer_pool_size: 26843545600 (25G) table_cache: 10240 open_files_limit: 22990 sysctl fs.file-max: 3150937 ulimit -Hn: 1024 ulimit -Sn: 1024
eswald over 13 years

InnoDB seems to be using an auto-increment lock for inserting new rows, which I have seen as an infrequent deadlock in show innodb status.
eswald over 13 years

I so wish the NOC would let us install innotop...
Alex over 13 years

mr_daemon, I'm intrigued by your asking if this is a Xen domU, as I have a very similar problem with a machine which is on a Xen domU... any pointers? Thanks!
mr_daemon over 13 years

@jimbo I asked because often I have seen Xen Dom0s with saturated disk IO, including my own, which a bit overloaded, or where there are problems with the disk backend (iSCSI, cache overflows, many VMs on same controller, etc etc), and this kind of stuff doesn't show at all from the DomU -- so you end up looking at the stats and going "Well everything looks fine here WHY IS THIS SLOW" and then the next DomU, sharing the same controller, is maxing out the TPS on that volume. It's just a bit harder to diagnose, but the same advice applies -- only on the Dom0, and with the xen tools :)