unigrams in SQL-CodePudding

I have the following table (tbl)

id | text
1  | example text
2  | this is an example text
3  | text text text

and I would like to return this table as output (unigrams)

ngram   | counts | n_ids
text    |  5     | 3
example |  2     | 2
this    |  1     | 1
is      |  1     | 1
an      |  1     | 1

And I thought of solving this using a cross join (I'm on Presto).

WITH 
  ngram_array AS (
  SELECT id, ngrams(split(text, ' ')) ngram_array FROM tbl
 ),
SELECT
  array_join(ngram, ' ') ngram,
  count(*) as counts,
  count(id) as n_ids
FROM ngram_array CROSS JOIN UNNEST (ngram_array) AS t(ngram)
GROUP BY ngram

This seems to be giving me ngrams but the columns counts and n_ids have the same value while I expect a difference because once is the count of the ngram over the entire sample and the second one is the number of documents each ngram is present.

Do you know what I might be doing wrong and is there a fiddle where I can test this online (I know fiddles for Postgres but can't find one for Presto).

CodePudding user response：

You can split text as needed into string array,unnest it and use distinct option for count for ids in group by:

-- sample data
WITH dataset (id, text) AS (
    VALUES (1,   'example text'),
        (2,   'this is an example text'),
        (3,   'text text text')
)

--query
SELECT word,
    count(*) counts,
    count(distinct id) n_ids -- count distinct ids
FROM (
        SELECT id,
            word
        FROM dataset
            CROSS JOIN UNNEST (split(text, ' ')) as t(word)
    )
GROUP BY word
ORDER BY counts desc -- order for output

Output:

word	counts	n_ids
text	5	3
example	2	2
this	1	1
is	1	1
an	1	1