I'm trying to get the total count of each GROUP where two words are used on a page together. For example, I have a word that appears 83 times on is own and another word that appears 22 times on its own, but within the same group (GROUP BY Book, Chapter, Page) they appear together 2 times, so I expect to get 2 from my query::
"SELECT SUM(pagetotal) AS total
FROM
(SELECT COUNT(DISTINCT Page) AS pagetotal
FROM Books
WHERE Word IN ('Apples', 'Oranges')
GROUP BY BookID, Chapter, Page
) AS t ";
This gives me a SUM of 103, which is clearly the opposite of what I want (it seems to be adding the individual usages of the word "Apples" (83) with that of "Oranges" (22), then SUBTRACTING the two pages where they're used together!) I've tried it with other entries in the column together and get the same result. What am I missing?
CodePudding user response:
GROUP BY BookID, Chapter, Page gives you one result row per book page. COUNT(DISTINCT Page) tells you how many pages are in the page, which is one, the page itself, obviously. You then add up all those ones to get the total number of pages one of the words exists in.
What you seem to want instead is the pages where both words appear in, for which you count them individually per page:
SELECT
BookID, Chapter, Page,
COUNT(CASE WHEN Word = 'Apples' THEN 1 END) as apples,
COUNT(CASE WHEN Word = 'Oranges' THEN 1 END) as oranges,
COUNT(*) AS total
FROM Books
WHERE Word IN ('Apples', 'Oranges')
GROUP BY BookID, Chapter, Page
HAVING COUNT(CASE WHEN Word = 'Apples' THEN 1 END) > 0
AND COUNT(CASE WHEN Word = 'Oranges' THEN 1 END) > 0
ORDER BY BookID, Chapter, Page;
If you just want the number of pages where both words occur together:
SELECT COUNT(*)
FROM
(
SELECT BookID, Chapter, Page,
FROM Books
WHERE Word IN ('Apples', 'Oranges')
GROUP BY BookID, Chapter, Page
HAVING COUNT(CASE WHEN Word = 'Apples' THEN 1 END) > 0
AND COUNT(CASE WHEN Word = 'Oranges' THEN 1 END) > 0
) pages_with_both_words;
The message of using CASE WHEN inside an aggreagtion function, as in COUNT(CASE WHEN Word = 'Oranges' THEN 1 END) is called conditional aggregation by the way. The standard SQL for this is:
COUNT(*) FILTER (WHERE Word = 'Oranges')
but only few DBMS support this, yet.
