I have the following table (tbl)
id | text
1 | example text
2 | this is an example text
3 | text text text
and I would like to return this table as output (unigrams)
ngram | counts | n_ids
text | 5 | 3
example | 2 | 2
this | 1 | 1
is | 1 | 1
an | 1 | 1
And I thought of solving this using a cross join (I'm on Presto).
WITH
ngram_array AS (
SELECT id, ngrams(split(text, ' ')) ngram_array FROM tbl
),
SELECT
array_join(ngram, ' ') ngram,
count(*) as counts,
count(id) as n_ids
FROM ngram_array CROSS JOIN UNNEST (ngram_array) AS t(ngram)
GROUP BY ngram
This seems to be giving me ngrams but the columns counts and n_ids have the same value while I expect a difference because once is the count of the ngram over the entire sample and the second one is the number of documents each ngram is present.
Do you know what I might be doing wrong and is there a fiddle where I can test this online (I know fiddles for Postgres but can't find one for Presto).
CodePudding user response:
You can split text as needed into string array,unnest it and use distinct option for count for ids in group by:
-- sample data
WITH dataset (id, text) AS (
VALUES (1, 'example text'),
(2, 'this is an example text'),
(3, 'text text text')
)
--query
SELECT word,
count(*) counts,
count(distinct id) n_ids -- count distinct ids
FROM (
SELECT id,
word
FROM dataset
CROSS JOIN UNNEST (split(text, ' ')) as t(word)
)
GROUP BY word
ORDER BY counts desc -- order for output
Output:
| word | counts | n_ids |
|---|---|---|
| text | 5 | 3 |
| example | 2 | 2 |
| this | 1 | 1 |
| is | 1 | 1 |
| an | 1 | 1 |
