Improving query performance by example-CodePudding

I'm trying to think out a way to improve a query the consumed schema is like this:

CREATE TABLE `orders` (
  `id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
  `store_id` INTEGER NOT NULL,
  `billing_profile_id` INTEGER NOT NULL,
  `billing_address_id` INTEGER NULL,
  `total` DECIMAL(8, 2) NOT NULL
);

CREATE TABLE `billing_profiles` (
  `id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
  `name` TEXT NOT NULL
);

CREATE TABLE `billing_addresses` (
  `id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
  `address` TEXT NOT NULL
);

CREATE TABLE `stores` (
  `id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
  `name` TEXT NOT NULL
);

The query I'm executing:

SELECT bp.name,
       ba.address,
       s.name,
       Sum(o.total) AS total
FROM   billing_profiles bp,
       stores s,
       orders o
       LEFT JOIN billing_addresses ba
              ON o.billing_address_id = ba.id
WHERE  o.billing_profile_id = bp.id
       AND s.id = o.store_id
GROUP  BY bp.name,
          ba.address,
          s.name;

And here is the EXPLAIN:

 ---- ------------- ------- ------------ -------- --------------- --------- --------- ------------------------------ ------- ---------- -------------------------------------------- 
| id | select_type | table | partitions | type   | possible_keys | key     | key_len | ref                          | rows  | filtered | Extra                                      |
 ---- ------------- ------- ------------ -------- --------------- --------- --------- ------------------------------ ------- ---------- -------------------------------------------- 
|  1 | SIMPLE      | bp    | NULL       | ALL    | PRIMARY       | NULL    | NULL    | NULL                         |155000 |   100.00 | Using temporary                            |
|  1 | SIMPLE      | o     | NULL       | ALL    | NULL          | NULL    | NULL    | NULL                         |220000 |    33.33 | Using where; Using join buffer (hash join) |
|  1 | SIMPLE      | ba    | NULL       | eq_ref | PRIMARY       | PRIMARY | 4       | factory.o.billing_address_id |    1  |   100.00 | NULL                                       |
|  1 | SIMPLE      | s     | NULL       | eq_ref | PRIMARY       | PRIMARY | 4       | factory.o.store_id           |    1  |   100.00 | NULL                                       |
 ---- ------------- ------- ------------ -------- --------------- --------- --------- ------------------------------ ------ ---------- --------------------------------------------

The problem I'm facing is that this query takes 30 secs to excute, we have over 200000 orders, and 150000 billing_profiles/billing_addresses.

What should I do regarding index/constraints so that this query becomes faster to execute?

Edit: after some suggestions in the comments I edited the query to:

SELECT bp.name,
       ba.address,
       s.name,
       Sum(o.total) AS total
FROM   orders o
       INNER JOIN billing_profiles bp
               ON o.billing_profile_id = bp.id
       INNER JOIN stores s
               ON s.id = o.store_id
       LEFT JOIN billing_addresses ba
              ON o.billing_address_id = ba.id
GROUP  BY bp.name,
          ba.address,
          s.name;

But still takes too much time.

CodePudding user response：

One thing I have used in the past and has helped in many instances with MySQL is to use the STRAIGHT_JOIN clause which tells the engine to do the query in the order as listed.

I have cleaned-up your query to proper JOIN context. Since the ORDER table is the primary basis of data, and the other 3 are lookup references to their respective IDs, I put the ORDER table first.

SELECT STRAIGHT_JOIN
        bp.name,
        ba.address,
        s.name,
        Sum(o.total) AS total
    FROM
        orders o
            JOIN stores s
                ON o.store_id = s.id
            JOIN billing_profiles bp
                on o.billing_profile_id = bp.id
            LEFT JOIN billing_addresses ba
                ON o.billing_address_id = ba.id
    GROUP BY 
        bp.name,
        ba.address,
        s.name

Now, your data tables dont appear that large, but if you are going to be grouping by 3 of the columns in the order table, I would have an index on the underlying basis of them, which are the "ID" keys linking to the other tables. Adding the total to help for a covering index / aggregate query, I would index on

( store_id, billing_profile_id, billing_address_id, total )

I'm sure that in reality, you have many other columns associated with an order and just showing the context for this query. Then, I would change to a pre-query so the aggregation is all done once for the orders table by their ID keys, THEN the result is joined to the lookup tables and you just need to apply an ORDER BY clause for your final output. Something like..

SELECT
        bp.name,
        ba.address,
        s.name,
        o.total
    FROM
        ( select
                store_id,
                billing_profile_id,
                billing_address_id,
                sum( total ) total 
            from
                orders 
            group by
                store_id,
                billing_profile_id,
                billing_address_id ) o
    JOIN stores s
        ON o.store_id = s.id
    JOIN billing_profiles bp
        on o.billing_profile_id = bp.id
    LEFT JOIN billing_addresses ba
        ON o.billing_address_id = ba.id
    ORDER BY 
        bp.name,
        ba.address,
        s.name

CodePudding user response：

Add this index to o, being sure to start with billing_profile_id:

INDEX(billing_profile_id, store_id, billing_address_id, total)

Discussion of the Explain:

The Optimizer saw that it needed to do a full scan of some table.
bp was smaller than o, so it picked bp as the "first" table.
Then it reached into the next table repeatedly.
It did not see a suitable index (one starting with billing_profile_id) and decided to do "Using join buffer (hash join)", which involves loading the entire table into a hash in RAM.
"Using temporary", though mentioned on the "first" table, really does not show up until just before the GROUP BY. (The GROUP BY references multiple tables, so there is no way to optimize it.)

Potential bug Please check the results of Sum(o.total) AS total. It is performed after all the JOINing and before the GROUP BY, so it may be inflated. Notice how DRapp's formulation does the SUM before the JOINs.