I'm trying to think out a way to improve a query the consumed schema is like this:
CREATE TABLE `orders` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`store_id` INTEGER NOT NULL,
`billing_profile_id` INTEGER NOT NULL,
`billing_address_id` INTEGER NULL,
`total` DECIMAL(8, 2) NOT NULL
);
CREATE TABLE `billing_profiles` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`name` TEXT NOT NULL
);
CREATE TABLE `billing_addresses` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`address` TEXT NOT NULL
);
CREATE TABLE `stores` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`name` TEXT NOT NULL
);
The query I'm executing:
SELECT bp.name,
ba.address,
s.name,
Sum(o.total) AS total
FROM billing_profiles bp,
stores s,
orders o
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
WHERE o.billing_profile_id = bp.id
AND s.id = o.store_id
GROUP BY bp.name,
ba.address,
s.name;
And here is the EXPLAIN:
---- ------------- ------- ------------ -------- --------------- --------- --------- ------------------------------ ------- ---------- --------------------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------- ------------ -------- --------------- --------- --------- ------------------------------ ------- ---------- --------------------------------------------
| 1 | SIMPLE | bp | NULL | ALL | PRIMARY | NULL | NULL | NULL |155000 | 100.00 | Using temporary |
| 1 | SIMPLE | o | NULL | ALL | NULL | NULL | NULL | NULL |220000 | 33.33 | Using where; Using join buffer (hash join) |
| 1 | SIMPLE | ba | NULL | eq_ref | PRIMARY | PRIMARY | 4 | factory.o.billing_address_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | s | NULL | eq_ref | PRIMARY | PRIMARY | 4 | factory.o.store_id | 1 | 100.00 | NULL |
---- ------------- ------- ------------ -------- --------------- --------- --------- ------------------------------ ------ ---------- --------------------------------------------
The problem I'm facing is that this query takes 30 secs to excute, we have over 200000 orders, and 150000 billing_profiles/billing_addresses.
What should I do regarding index/constraints so that this query becomes faster to execute?
Edit: after some suggestions in the comments I edited the query to:
SELECT bp.name,
ba.address,
s.name,
Sum(o.total) AS total
FROM orders o
INNER JOIN billing_profiles bp
ON o.billing_profile_id = bp.id
INNER JOIN stores s
ON s.id = o.store_id
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
GROUP BY bp.name,
ba.address,
s.name;
But still takes too much time.
CodePudding user response:
One thing I have used in the past and has helped in many instances with MySQL is to use the STRAIGHT_JOIN clause which tells the engine to do the query in the order as listed.
I have cleaned-up your query to proper JOIN context. Since the ORDER table is the primary basis of data, and the other 3 are lookup references to their respective IDs, I put the ORDER table first.
SELECT STRAIGHT_JOIN
bp.name,
ba.address,
s.name,
Sum(o.total) AS total
FROM
orders o
JOIN stores s
ON o.store_id = s.id
JOIN billing_profiles bp
on o.billing_profile_id = bp.id
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
GROUP BY
bp.name,
ba.address,
s.name
Now, your data tables dont appear that large, but if you are going to be grouping by 3 of the columns in the order table, I would have an index on the underlying basis of them, which are the "ID" keys linking to the other tables. Adding the total to help for a covering index / aggregate query, I would index on
( store_id, billing_profile_id, billing_address_id, total )
I'm sure that in reality, you have many other columns associated with an order and just showing the context for this query. Then, I would change to a pre-query so the aggregation is all done once for the orders table by their ID keys, THEN the result is joined to the lookup tables and you just need to apply an ORDER BY clause for your final output. Something like..
SELECT
bp.name,
ba.address,
s.name,
o.total
FROM
( select
store_id,
billing_profile_id,
billing_address_id,
sum( total ) total
from
orders
group by
store_id,
billing_profile_id,
billing_address_id ) o
JOIN stores s
ON o.store_id = s.id
JOIN billing_profiles bp
on o.billing_profile_id = bp.id
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
ORDER BY
bp.name,
ba.address,
s.name
CodePudding user response:
Add this index to o, being sure to start with billing_profile_id:
INDEX(billing_profile_id, store_id, billing_address_id, total)
Discussion of the Explain:
- The Optimizer saw that it needed to do a full scan of some table.
bpwas smaller thano, so it pickedbpas the "first" table.- Then it reached into the next table repeatedly.
- It did not see a suitable index (one starting with
billing_profile_id) and decided to do "Using join buffer (hash join)", which involves loading the entire table into a hash in RAM. - "Using temporary", though mentioned on the "first" table, really does not show up until just before the
GROUP BY. (TheGROUP BYreferences multiple tables, so there is no way to optimize it.)
Potential bug Please check the results of Sum(o.total) AS total. It is performed after all the JOINing and before the GROUP BY, so it may be inflated. Notice how DRapp's formulation does the SUM before the JOINs.
