Let's say I have a table in Postgres that stores a column of strings like this.
| animal |
|---|
| cat/dog/bird |
| dog/lion |
| bird/dog |
| dog/cat |
| cat/bird |
What I want to do, is calculate how "correlated" any two animals are to each other in this column, and store that as its own table so that I can easily look up how often "cat" and "dog" show up together.
For example, "cat" shows up a total of 3 times in all of these strings. Of those instances, "dog" shows up in the same string 2 out of the three times. Therefore, the correlation from cat -> dog would be 66%, and the number of co-occurrence instances (we'll call this instance_count) would be 2.
According to the above logic, the resulting table from this example would look like this.
| base_animal | correlated_animal | instance_count | correlation |
|---|---|---|---|
| cat | cat | 3 | 100 |
| cat | dog | 2 | 66 |
| cat | bird | 2 | 66 |
| cat | lion | 0 | 33 |
| dog | dog | 4 | 100 |
| dog | cat | 2 | 50 |
| dog | bird | 2 | 50 |
| dog | lion | 1 | 25 |
| bird | bird | 3 | 100 |
| bird | cat | 2 | 66 |
| bird | dog | 2 | 66 |
| bird | lion | 0 | 0 |
| lion | lion | 1 | 100 |
| lion | cat | 0 | 0 |
| lion | dog | 1 | 100 |
| lion | bird | 0 | 0 |
I've come up with a working solution in Python, but I have no idea how to do this easily in Postgres. Anybody have any ideas?
CodePudding user response:
Idea is to split the data into rows (using unnest(string_to_array())) and then cross-join same to get all permutations.
with data1 as (
select *
from corr_tab), data2 as (
select distinct un as base_animal, x.correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un,
(select distinct un as correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal,
(case
when
data2.base_animal = data2.correlated_animal
then
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case
when
data2.base_animal = data2.correlated_animal
then
100
else
ceil(
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL) * 100 /
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal
Refer to fiddle here.
CodePudding user response:
Won't get much simpler or faster than this:
WITH flat AS (
SELECT t.id, a
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
, b AS corr_animal
, COALESCE(xc.ct, 0) AS instance_count
, COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM (
SELECT a.a, b.a AS b, a.ct
FROM ct a, ct b
) x
LEFT JOIN (
SELECT f1.a, f2.a AS b, count(*) AS ct
FROM flat f1
JOIN flat f2 USING (id)
GROUP BY 1,2
) xc USING (a,b)
ORDER BY a, instance_count DESC;
db<>fiddle here
Produces your desired result, except for ...
- added consistent sort order
- rounded correctly
Setp-by-step
CTE flat attaches an arbitrary row number as unique id. (If you have a PRIMARY KEY, use that instead and skip the subquery t.) Then unnest animals to get one pet per row (& id).
CTE ct gets the list of distinct animals & their total count.
The outer SELECT builds the complete raster of animal pairs (a / b) in subquery x, plus total count for a. LEFT JOIN to the actual pair count in subquery xc. Two steps are needed to keep pairs that never met in the result. Finally, compute and round the "correlation" smartly. See:
