In project we use MySQL table for analytics. It is big table 40 columns, more than 10kk rows. A big part of time in query take result calculation (50 cols in result). Idea is to reuse calculated values as variables and make it faster. Query example:
SELECT col1, SUM(col2) as s_col2, SUM(col3) as s_col3, AVG(col2) as a_col2, ...,
SUM(col2)/SUM(col3) as aaa,
ROUND(AVG(col4), 2) as a_col4,
ROUND(SUM(col5), 2) as s_col5,
ROUND(SUM(col5)/AVG(col4), 2) as zzz,
...
JOIN ...
GROUP_BY ...
ORDER BY ...
Idea is to use @variable, for example:
SELECT col1, @s_col2 := SUM(col2) as s_col2, @s_col3 := SUM(col3) as s_col3, ...,
@s_col2/@s_col3 as aaa,
It works only for a few variables which are outside function, but I don't need additional columns for every variable.
@a_col4 := AVG(col4), #I don`t need this column
@s_col5 := SUM(col5), #I don`t need this column
ROUND(@a_col4, 2) as a_col4,
ROUND(@s_col5, 2) as s_col5,
ROUND(@s_col5/@a_col4, 2) as zzz,
How I can assign variables inside functions?
ROUND(@a_col4 := AVG(col4), 2) as a_col4, #not works
ROUND((@s_col5 := SUM(col5)), 2) as s_col5, #not works
ROUND(@s_col5/@a_col4, 2) as zzz,
UPDATED: Thanks guys for your help.
The MySQL engine is probably smart enough to compute the value of SUM(col5) only once
I am not sure because for a big quantity of columns
SUM(col1) as a1,
SUM(col1) as a2,
SUM(col1) as a3,
SUM(col1) as a4,
SUM(col1) as a5,
Is slower than
@a1 := SUM(col1) as a1,
@a1 as a2,
@a1 as a3,
@a1 as a4,
@a1 as a5,
you can also use CTE for reusing table results
I tried, but some values use too many functions one inside another, sometimes 7 levels deep and not all variables can be reused in this way (COALESCE(ROUND(COALESCE(ROUND(SUM(AVG(IF..AND...OR...AND)...
All my changes, (15 variables) have very small effect, for the small period it takes 139 sec (was 151 sec), but some of our reports take a few hours and we need stronger optimisation.
We will try to analyse server bottlenecks, maybe use partitioning, sharding...
As a general rule, the number of rows touched is much more important to how long a query will take than the functions being evaluated.
The number of rows is always big, a lot of indexes and it works really fast. If I comment columns where we need calculations and only select existing it will take 40 sec (instead of 150)
CodePudding user response:
The qwy to do it in sql is to use the Select withe with the sum and average as basis for an outer select
SELECT *,
s_col2/s_col3 as aaa,
ROUND(a_col4, 2) as a_col4,
ROUND(acol5, 2) as s_col5,
ROUND(s_col5/a_col4, 2) as zzz.
...
FROM
(SELECT col1, SUM(col2) as s_col2, SUM(col3) as s_col3, AVG(col2) as a_col2, ...,
...
JOIN ...
GROUP_BY ...) t1
ORDER BY ...
in MySQL 8 you can also use CTE for reusing tbale rewsults see manual
CodePudding user response:
Don't worry.
As a general rule, the number of rows touched is much more important to how long a query will take than the functions being evaluated.
A similar question comes from the choice between these:
SELECT foo, COUNT(*) FROM x GROUP BY foo ORDER BY COUNT(*) DESC LIMIT 5;
SELECT foo, COUNT(*) FROM x GROUP BY foo ORDER BY 1 DESC LIMIT 5;
I often do the latter because it is fewer keystrokes. I have not been able to determine whether it is faster.
I suggest you write your question in the simplest or clearest way. A subquery (or CTE) may actually be clearer in spite of taking more keystrokes.
Clarity and correctness are more important than speed.
And beware -- With both JOIN and GROUP BY in the query, you may have incorrect results. The JOIN is done before the aggregation; the GROUP BY comes after. Check to see if COUNT or SUM is bigger than it should be. If so, you will need a subquery or CTE.
