Why is my MySQL group by so slow?

10,600

The offending part is probably the GROUP BY DATE(transaction_utc). You also claim to have a covering index for this query but I see none. Your 5-column index has all the columns used in the query but not in the best order (which is: WHERE - GROUP BY - SELECT).

So, the engine, finding no useful index, would have to evaluate this function for all the 20M rows. Actually, it finds an index that starts with username (the idx_unique) and it uses that, so it has to evaluate the function for (only) 1.2M rows. If you had a (transaction_utc) or a (username, transaction_utc) it would choose the most useful of the three.

Can you afford to change the table structure by splitting the column into date and time parts? If you can, then an index on (username, country_id, transaction_date) or (changing the order of the two columns used for grouping), on (username, transaction_date, country_id) would be quite efficient.

A covering index on (username, country_id, transaction_date, sale_usd, commission_usd) even better.


If you want to keep the current structure, try changing the order inside your 5-column index to:

(username, country_id, transaction_utc, sale_usd, commission_usd)

or to:

(username, transaction_utc, country_id, sale_usd, commission_usd)

Since you are using MariaDB, you can use the VIRTUAL columns feature, without changing the existing columns:

Add a virtual (persistent) column and the appropriate index:

ALTER TABLE sales 
    ADD COLUMN transaction_date DATE NOT NULL
               AS DATE(transaction_utc) 
               PERSISTENT 
    ADD INDEX special_IDX 
        (username, country_id, transaction_date, sale_usd, commission_usd) ;
Share:
10,600
JesseP
Author by

JesseP

Updated on June 04, 2022

Comments

  • JesseP
    JesseP almost 2 years

    I am trying to query against a partitioned table (by month) approaching 20M rows. I need to group by DATE(transaction_utc) as well as country_id. The rows that get returned if i turn off the group by and aggregates is just over 40k, which isn't too many, however adding the group by makes the query substantially slower unless said GROUP BY is on the transaction_utc column, in which case it gets FAST.

    I've been trying to optimize this first query below by tweaking the query and/or the indexes, and got to the point below (about 2x as fast as initially) however still stuck with a 5s query for summarizing 45k rows, which seems way too much.

    For reference, this box is a brand new 24 logical core, 64GB RAM, Mariadb-5.5.x server with way more INNODB buffer pool available than index space on the server, so shouldn't be any RAM or CPU pressures.

    So, I'm looking for ideas on what is causing this slow down and suggestions on speeding it up. Any feedback would be greatly appreciated! :)

    Ok, onto the details...

    The following query (the one I actually need) takes approx 5 seconds (+/-), and returns less than 100 rows.

    SELECT lss.`country_id` AS CountryId
    , Date(lss.`transaction_utc`) AS TransactionDate
    , c.`name` AS CountryName,  lss.`country_id` AS CountryId
    , COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
    , COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD  
    FROM `sales` lss  
    JOIN `countries` c ON lss.`country_id` = c.`country_id`  
    WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )  GROUP BY lss.`country_id`, DATE(lss.`transaction_utc`)
    

    EXPLAIN SELECT for the same query is as follows. Notice that it's not using the transaction_utc key. Shouldn't it be using my covering index instead?

    id  select_type table   type    possible_keys   key key_len ref rows    Extra
    1   SIMPLE  lss ref idx_unique,transaction_utc,country_id   idx_unique  50  const   1208802 Using where; Using temporary; Using filesort
    1   SIMPLE  c   eq_ref  PRIMARY PRIMARY 4   georiot.lss.country_id  1   
    

    Now onto a couple other options that I've tried to attempt to determine whats going on...

    The following query (changed group by) takes about 5 seconds (+/-), and returns only 3 rows:

    SELECT lss.`country_id` AS CountryId
    , DATE(lss.`transaction_utc`) AS TransactionDate
    , c.`name` AS CountryName,  lss.`country_id` AS CountryId
    , COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
    , COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD  
    FROM `sales` lss  
    JOIN `countries` c ON lss.`country_id` = c.`country_id`  
    WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )  GROUP BY lss.`country_id`
    

    The following query (removed group by) takes 4-5 seconds (+/-) and returns 1 row:

    SELECT lss.`country_id` AS CountryId
        , DATE(lss.`transaction_utc`) AS TransactionDate
        , c.`name` AS CountryName,  lss.`country_id` AS CountryId
        , COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
        , COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD  
        FROM `sales` lss  
        JOIN `countries` c ON lss.`country_id` = c.`country_id`  
        WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
    

    The following query takes .00X seconds (+/-) and returns ~45k rows. This to me shows that at max we're only trying to group 45K rows into less than 100 groups (as in my initial query):

    SELECT lss.`country_id` AS CountryId
        , DATE(lss.`transaction_utc`) AS TransactionDate
        , c.`name` AS CountryName,  lss.`country_id` AS CountryId
        , COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
        , COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD  
        FROM `sales` lss  
        JOIN `countries` c ON lss.`country_id` = c.`country_id`  
        WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
    GROUP BY lss.`transaction_utc`
    

    TABLE SCHEMA:

    CREATE TABLE IF NOT EXISTS `sales` (
      `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      `user_linkshare_account_id` int(11) unsigned NOT NULL,
      `username` varchar(16) NOT NULL,
      `country_id` int(4) unsigned NOT NULL,
      `order` varchar(16) NOT NULL,
      `raw_tracking_code` varchar(255) DEFAULT NULL,
      `transaction_utc` datetime NOT NULL,
      `processed_utc` datetime NOT NULL ,
      `sku` varchar(16) NOT NULL,
      `sale_original` decimal(10,4) NOT NULL,
      `sale_usd` decimal(10,4) NOT NULL,
      `quantity` int(11) NOT NULL,
      `commission_original` decimal(10,4) NOT NULL,
      `commission_usd` decimal(10,4) NOT NULL,
      `original_currency` char(3) NOT NULL,
      PRIMARY KEY (`id`,`transaction_utc`),
      UNIQUE KEY `idx_unique` (`username`,`order`,`processed_utc`,`sku`,`transaction_utc`),
      KEY `raw_tracking_code` (`raw_tracking_code`),
      KEY `idx_usd_amounts` (`sale_usd`,`commission_usd`),
      KEY `idx_countries` (`country_id`),
      KEY `transaction_utc` (`transaction_utc`,`username`,`country_id`,`sale_usd`,`commission_usd`)
    ) ENGINE=InnoDB  DEFAULT CHARSET=utf8
    /*!50100 PARTITION BY RANGE ( TO_DAYS(`transaction_utc`))
    (PARTITION pOLD VALUES LESS THAN (735112) ENGINE = InnoDB,
     PARTITION p201209 VALUES LESS THAN (735142) ENGINE = InnoDB,
     PARTITION p201210 VALUES LESS THAN (735173) ENGINE = InnoDB,
     PARTITION p201211 VALUES LESS THAN (735203) ENGINE = InnoDB,
     PARTITION p201212 VALUES LESS THAN (735234) ENGINE = InnoDB,
     PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ AUTO_INCREMENT=19696320 ;
    
  • JesseP
    JesseP over 11 years
    Yes, I thought about splitting the date/time, but figured there was still some other issue. I do have an index on those same fields in a differnet order, you saying I should just modify the order of the fields in the index?
  • ypercubeᵀᴹ
    ypercubeᵀᴹ over 11 years
    Yes, the order of columns in the index matters a lot. The only other index that might be good (or even better) is if you transpose the two columns: (username, transaction_utc, country_id)
  • JesseP
    JesseP over 11 years
    wouldn't changing the order of the query do the same? (i saw this didn't have noticeable difference in perf). WHERE ( lss.username = 'someuser' AND lss.transaction_utc BETWEEN '2012-09-26' AND '2012-10-26') Or would leaving the query as is and changing the index as you noted be better? I guess i'm unclear on the order of execution of each part of the query, as i thought it would need to be ("WHERE CLAUSE ITEMS", "GROUP BY ITEMS", "JOIN ITEMS", "RETURNED ITEMS").
  • ypercubeᵀᴹ
    ypercubeᵀᴹ over 11 years
    The order in the WHERE does not matter. WHERE a_condition AND b_condition and WHERE b_condition AND a_condition are equivalent and the optimizer knows that.
  • JesseP
    JesseP over 11 years
    Interesting. I thought the order mattered when choosing the index but i suppose i need to read through the docs in more details. The VIRTUAL is something that i didn't not know about, but I can definitely leverage elsewhere (and probably here too). Thanks for that! I'm going to try your index reorder suggestion and see how that works.
  • JesseP
    JesseP over 11 years
  • JesseP
    JesseP over 11 years
    This helped a ton. Thanks for taking the time to explain the reason that the existing index wasn't used (having transaction_utc as a range and not a equality check). I changed the index to have username as the first field, and its much faster now. For anyone else trying to understand/fix a similar issue, please also read the chat discussion as some of the explanations in there helped me understand the reasoning.