"SELECT COUNT(*)" is slow, even with where clause

75,238

Solution 1

InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages. In order to do a range scan you still have to scan through all of the potentially wide rows in data pages; note that this table contains a TEXT column.

Two things I would try:

  1. run optimize table. This will ensure that the data pages are physically stored in sorted order. This could conceivably speed up a range scan on a clustered primary key.
  2. create an additional non-primary index on just the change_event_id column. This will store a copy of that column in index pages which be much faster to scan. After creating it, check the explain plan to make sure it's using the new index.

(you also probably want to make the change_event_id column bigint unsigned if it's incrementing from zero)

Solution 2

Here are a few things I suggest:

  • Change the column from a "bigint" to an "int unsigned". Do you really ever expect to have more than 4.2 billion records in this table? If not, then you're wasting space (and time) the the extra-wide field. MySQL indexes are more efficient on smaller data types.

  • Run the "OPTIMIZE TABLE" command, and see whether your query is any faster afterward.

  • You might also consider partitioning your table according to the ID field, especially if older records (with lower ID values) become less relevant over time. A partitioned table can often execute aggregate queries faster than one huge, unpartitioned table.


EDIT:

Looking more closely at this table, it looks like a logging-style table, where rows are inserted but never modified.

If that's true, then you might not need all the transactional safety provided by the InnoDB storage engine, and you might be able to get away with switching to MyISAM, which is considerably more efficient on aggregate queries.

Solution 3

I've run into behavior like this before with IP geolocation databases. Past some number of records, MySQL's ability to get any advantage from indexes for range-based queries apparently evaporates. With the geolocation DBs, we handled it by segmenting the data into chunks that were reasonable enough to allow the indexes to be used.

Solution 4

Check to see how fragmented your indexes are. At my company we have a nightly import process that trashes our indexes and over time it can have a profound impact on data access speeds. For example we had a SQL procedure that took 2 hours to run one day after de-fragmenting the indexes it took 3 minutes. we use SQL Server 2005 ill look for a script that can check this on MySQL.

Update: Check out this link: http://dev.mysql.com/doc/refman/5.0/en/innodb-file-defragmenting.html

Solution 5

Run "analyze table_name" on that table - it's possible that the indices are no longer optimal.

You can often tell this by running "show index from table_name". If the cardinality value is NULL then you need to force re-analysis.

Share:
75,238
JocelynH
Author by

JocelynH

I'm a freelance contractor specializing in large-scale, database driven code bases. I also specialize in making test suites faster and easier to work with. I also do agile consulting, but at the strategic (business) level as the tactical level is covered by plenty of people with expertise in Scrum, XP, and so on. I do training in making companies agile, testing, and Perl in general. I add real value to companies by making their code more robust, easier to maintain, and making developers more productive by organizing their test suites and implementing agile and lean processes. Even if I'm working on a current contract, drop me a line and if I can't help you, I can probably find someone who can.

Updated on January 28, 2020

Comments

  • JocelynH
    JocelynH over 4 years

    I'm trying to figure out how to optimize a very slow query in MySQL (I didn't design this):

    SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
    +----------+
    | COUNT(*) |
    +----------+
    |  3224022 |
    +----------+
    1 row in set (1 min 0.16 sec)
    

    Comparing that to a full count:

    select count(*) from change_event;
    +----------+
    | count(*) |
    +----------+
    |  6069102 |
    +----------+
    1 row in set (4.21 sec)
    

    The explain statement doesn't help me here:

     explain SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391'\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: me
             type: range
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 8
              ref: NULL
             rows: 4120213
            Extra: Using where; Using index
    1 row in set (0.00 sec)
    

    OK, it still thinks it needs roughly 4 million entries to count, but I could count lines in a file faster than that! I don't understand why MySQL is taking this long.

    Here's the table definition:

    CREATE TABLE `change_event` (
      `change_event_id` bigint(20) NOT NULL default '0',
      `timestamp` datetime NOT NULL,
      `change_type` enum('create','update','delete','noop') default NULL,
      `changed_object_type` enum('Brand','Broadcast','Episode','OnDemand') NOT NULL,
      `changed_object_id` varchar(255) default NULL,
      `changed_object_modified` datetime NOT NULL default '1000-01-01 00:00:00',
      `modified` datetime NOT NULL default '1000-01-01 00:00:00',
      `created` datetime NOT NULL default '1000-01-01 00:00:00',
      `pid` char(15) default NULL,
      `episode_pid` char(15) default NULL,
      `import_id` int(11) NOT NULL,
      `status` enum('success','failure') NOT NULL,
      `xml_diff` text,
      `node_digest` char(32) default NULL,
      PRIMARY KEY  (`change_event_id`),
      KEY `idx_change_events_changed_object_id` (`changed_object_id`),
      KEY `idx_change_events_episode_pid` (`episode_pid`),
      KEY `fk_import_id` (`import_id`),
      KEY `idx_change_event_timestamp_ce_id` (`timestamp`,`change_event_id`),
      KEY `idx_change_event_status` (`status`),
      CONSTRAINT `fk_change_event_import` FOREIGN KEY (`import_id`) REFERENCES `import` (`import_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    

    Version:

    $ mysql --version
    mysql  Ver 14.12 Distrib 5.0.37, for pc-solaris2.8 (i386) using readline 5.0
    

    Is there something obvious I'm missing? (Yes, I've already tried "SELECT COUNT(change_event_id)", but there's no performance difference).

  • Random Developer
    Random Developer over 15 years
    Here is a link dev.mysql.com/doc/refman/5.0/en/innodb-file-defragmenting.ht‌​ml best of luck with everything
  • JocelynH
    JocelynH over 15 years
    "analyze table change_event" had no impact on performance. Thanks, though.
  • JocelynH
    JocelynH over 15 years
    What a nasty solution. Nonetheless, I brought it up earlier and barring some strange configuration fix or other solution, we might be forced to go this route :(
  • Alnitak
    Alnitak over 15 years
    did it make the plain "select count()" any faster? I've just tried on a 110M record MyISAM table. "select count()" was instant. Selecting the count for ~half the table took 2m48 the first time, and 27s the second time.
  • JocelynH
    JocelynH over 15 years
    Except that we need counts on ranges, so a managing a count via triggers doesn't work (unless I've misunderstood you).
  • JocelynH
    JocelynH over 15 years
    MyISAM has radically different performance characteristics from InnoDB. That's because MyISAM does table level locking and effectively only has one transaction at a time. InnoDB behaves much differently under the covers.
  • MiniQuark
    MiniQuark over 15 years
    You might want to put that link in your answer?
  • Rob Williams
    Rob Williams over 15 years
    This is a great solution that respects a basic principle of computer solutions: programming in-the-large is qualitatively different from programming in-the-small. In the case of databases, the access plans and the use of indexes changes dramatically as size increases past certain thresholds.
  • JocelynH
    JocelynH over 15 years
    Given that we have numbers like "1212281603783397", I think that already overflows "int unsigned" (it's a high-res timestamp). "OPTIMIZE TABLE" had no performance impact :( Isn't MyISAM much slower with "where" clauses since it needs to do a table scan? Also, we'd lose our FK constraint.
  • Daniel Liu
    Daniel Liu over 15 years
    Why use a timestamp for your primary key, if you already have a timestamp field? Also, what happens if two events happen at the same instant? If I were you, I'd use a simple auto-increment field for the pkey.
  • Daniel Liu
    Daniel Liu over 15 years
    The WHERE clause doesn't necessarily cause a full table scan. For a simple query (equals, less-than, greater-than, etc) on an indexed column, the query optimizer uses the index to find relevant pages, and then only scans those pages. A FTS would be required if you were doing date-math or substrings.
  • ʞɔıu
    ʞɔıu over 15 years
    an auto-increment key might actually be suboptimal for a logging table in innodb as it requires a brief full table lock in order to acquire the next increment.
  • JocelynH
    JocelynH over 15 years
    The "optimize table" didn't help much, but the redundant index solved the problem. Thanks!
  • Daniel Liu
    Daniel Liu over 15 years
    Oh, and with respect to "losing the FK constraint", I wouldn't worry too much about it. You can still join against the 'import' table, using the same foreign key. You just can't ask MyISAM to enforce that constraint. Depending on your data, that might be a sacrifice you can live with.
  • JocelynH
    JocelynH over 15 years
    The hires-timestamp was needed because InnoDB does a full table lock to get the next key and that was a significant performance hit. It wasn't my decision, but it worked.
  • JocelynH
    JocelynH over 15 years
    benjismith: We really don't have much to simulate a proper load, unfortunately. Given my workload right now, I won't be able to return to this unless we have more issues (though I'd like to know this myself).
  • itsjavi
    itsjavi over 15 years
    It doesn't actually take a full table lock in the way you think. The table lock for an AUTO_INCREMENT insert is to end-of-statement, not end-of-transaction. * dev.mysql.com/doc/refman/5.1/en/…
  • Daniel Liu
    Daniel Liu over 15 years
    Glad you found a solution. I've always explicitly added an index to my pkey columns, so I did a brief double-take when I looked at your table definition, but I made the same assumption as you did that the pkey declaration would be sufficient. Anyhow, cheers!
  • Mark Amery
    Mark Amery almost 10 years
    This is the first time I've ever seen anyone suggest creating a redundant index on a PRIMARY KEY column as a performance hack in MySQL. I'm pretty interested in the details of why this works and the kinds of queries for which it is useful. Do you have any links to further reading on the topic?
  • shashi009
    shashi009 about 8 years
    I came across a similar problem with geolocation database, and after various optimization attempts like indexing, partitioning etc i just gave a shot to dividing large tables into smaller dataset, which finally proved to be acceptable in terms of performance.
  • Rick James
    Rick James about 7 years
    OPTIMIZE TABLE is rarely of use, especially on InnoDB tables. Any improvement may be because you freshly loaded the entire table into cache.
  • Barry Kelly
    Barry Kelly about 6 years
    @MarkAmery MySQL innodb format stores all row data in the primary index; if you don't have a primary key, one is synthesized for use in the storage index. This means that rather than it being an index over bigints, it's an index over the whole data tuple, so it has to stride through, so it's not fast to scan.
  • Barry Kelly
    Barry Kelly about 6 years
    @MarkAmery for more details, see dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html - primary key index is clustered index for row storage - it's subtle implication, blink and you miss it.
  • rupashka
    rupashka over 4 years
    I tried to add a redundant index for primary key and it didn't help. The following simple query take 30-50 milliseconds, but other queries are much faster: SELECT COUNT(*) FROM CL_USER WHERE pk <= 172114. The table is CREATE TABLE CL_USER (PK int unsigned NOT NULL AUTO_INCREMENT,...
  • Tofandel
    Tofandel about 3 years
    Switching to MyISAM for me changed everything, from 4s count to 0.01s, the table storage size also dropped considerably, I didn't have any FK on this one so perfect solution in my case
  • Tofandel
    Tofandel about 3 years
    For a geolocation database, you should definitely use MyISAM