Would it be faster to make a python script to aggregate a table?, or would a built in SQL aggregate-CodePudding

Currently, I have a little problem where I'm expected to build a table that shows the energy generated for the respected days.

I have solved this problem using python with SQL data polling combined with a for loop to look at the energy generated at the beginning of the day to the end of the day and the difference between the two will result in the total energy generated for the particular day. But unfortunately due to the amount of data that's coming out of the SQL database the python function is too slow.

I was wondering if this can be integrated within an SQL query to just spit out a table after it has done the aggregation. I have shown an example below for a better understanding of the table.

SQL TABLE

date/time	value
24/01/2022 2:00	2001
24/01/2022 4:00	2094
24/01/2022 14:00	3024
24/01/2022 17:00	4056
25/01/2022 2:00	4056
25/01/2022 4:00	4392
25/01/2022 17:00	5219

Final Table From the above table, we can work that the energy generated for 24/01/2022 is 4056(max)-2001(min)= 2055

date	value
24/01/2022	2055
25/01/2022	1163

CodePudding user response：

Yes, it will be faster to aggregate stuff before sending it over the network, group by is very fast:

https://stackoverflow.com/questions/4889669/whats-the-asymptotic-complexity-of-groupby-operation#:~:text=Ignoring the base SQL that,the size of the dataset).

CodePudding user response：

Usually, the time spent sending more stuff across the network makes the app-solution slower.

The GROUP BY may cost an extra sort, or it may be "free" if the data is sorted that way. (OK, you say unindexed.)

Show us the query and SHOW CREATE TABLE; we can help with indexing.

Generally, there is much less coding for the user if the work is done in SQL.

MySQL, in particular, picks between

Case 1: Sort the data O(N*log N), then make a linear pass through the data; this may or may not involve I/O which would add overhead

Case 2: Build a lookup table in RAM for collecting the grouped info, then making a linear pass over the data (no index needed); but then you need something like O(N*log n) for counting/summing/whatever the grouped value.

Notes:

I used N for the number or rows in the table and n for the number of rows in the output.
I do not know the conditions that would cause the Optimizer to pick one method versus the other.

If you drag all the data into the client, you would probably pick one of those algorithms. If you happen to know that you are grouping on a simple integer, the lookup (for the second algorithm) could be a simply array lookup -- O(N). But, as I say, the network cost is likely to kill the performance.

It is simple enough to write is SQL:

SELECT DATE(`date`)  AS "day",
       MAX(value) - MIN(value) AS range
    FROM tbl
    GROUP BY DATE(`date`);