Home > database >  unigrams in SQL
unigrams in SQL

Time:01-25

I have the following table (tbl)

id | text
1  | example text
2  | this is an example text
3  | text text text

and I would like to return this table as output (unigrams)

ngram   | counts | n_ids
text    |  5     | 3
example |  2     | 2
this    |  1     | 1
is      |  1     | 1
an      |  1     | 1

And I thought of solving this using a cross join (I'm on Presto).

WITH 
  ngram_array AS (
  SELECT id, ngrams(split(text, ' ')) ngram_array FROM tbl
 ),
SELECT
  array_join(ngram, ' ') ngram,
  count(*) as counts,
  count(id) as n_ids
FROM ngram_array CROSS JOIN UNNEST (ngram_array) AS t(ngram)
GROUP BY ngram

This seems to be giving me ngrams but the columns counts and n_ids have the same value while I expect a difference because once is the count of the ngram over the entire sample and the second one is the number of documents each ngram is present.

Do you know what I might be doing wrong and is there a fiddle where I can test this online (I know fiddles for Postgres but can't find one for Presto).

CodePudding user response:

You can split text as needed into string array,unnest it and use distinct option for count for ids in group by:

-- sample data
WITH dataset (id, text) AS (
    VALUES (1,   'example text'),
        (2,   'this is an example text'),
        (3,   'text text text')
)

--query
SELECT word,
    count(*) counts,
    count(distinct id) n_ids -- count distinct ids
FROM (
        SELECT id,
            word
        FROM dataset
            CROSS JOIN UNNEST (split(text, ' ')) as t(word)
    )
GROUP BY word
ORDER BY counts desc -- order for output

Output:

word counts n_ids
text 5 3
example 2 2
this 1 1
is 1 1
an 1 1
  •  Tags:  
  • Related