SQL - Count how many unique IDs would be w/o a category if "category bucket" was removed-CodePudding

I am using Snowflake for this SQL question if there are any unique functions I can use, please help me out!

I have a data set with unique ids, other attributes that aren’t important, and then a list of categories (~22) each Unique id could fall into (denoted by a 1 if it’s in the category and 0 if not.)

I am trying to figure out how to write something where I could see if across all the categories, if a category was removed if any of the unique ids would then be left without any category and count how many unique ids would then do total how many ids would be left category less.

Example below for unique id Jshshsv it is only in CatAA but id Hairbdb is in CatY and CatAA. If CatAA was dropped, how many Ids would be left with no category?

UniqueID	Sum across Categories	CatX	CatY	CatZ	CatAA
Hairbdb	2	0	1	0	1
Jshshsv	1	0	0	0	1

For some reason I just cannot figure out how to do this in a manageable way in sql with so many category buckets. Any tips or things to try would be appreciated.

CodePudding user response：

The follow SQL can be used to find the Catagories are the single match for an ID.

WITH data AS (
    SELECT * FROM VALUES
    ('Hairbdb','CatY'),
    ('Hairbdb','CatAA'),
    ('Jshshsv','CatAA')
    v(id, cat)
), dist_data AS (
    SELECT DISTINCT id, cat FROM data
), cat_counts AS (
    SELECT id, count(distinct cat) c_cat
    FROM data
    GROUP BY 1
    HAVING c_cat = 1
)  
SELECT a.cat, a.id
FROM dist_data AS a
JOIN cat_counts AS b 
    ON b.id = a.id;

This works because you first count per id, how many categories the id is in, then you join the distinct data with those where the id is only in one cat, will give you id & cat

CAT	ID
CatAA	Jshshsv

IF you data is in a wide format (like how you present it), you can turn it into my form via UNPIVOT like so:

WITH data AS (
    SELECT * FROM VALUES
    ('Hairbdb',0,1,0,1),
    ('Jshshsv',0,0,0,1)
    v(id, catx, caty, catz, cataa )
)
SELECT id, cat from data unpivot(catv for cat in (catx, caty, catz, cataa))
WHERE catv = 1;

giving:

ID	CAT
Hairbdb	CATY
Hairbdb	CATAA
Jshshsv	CATAA

But if it's in your form with duplicates removed you could just use a WHERE clause:

WITH data AS (
    SELECT * from values
        ('Hairbdb', 0, 1, 0, 1),
        ('Jshshsv', 0, 0, 0, 1)
     v(UniqueID, CatX,CatY,CatZ, CatAA) 
)
SELECT UniqueID,
    CatX CatY CatZ CatAA as "Sum across Categories",
    CatX,
    CatY,
    CatZ,
    CatAA  
FROM data
WHERE "Sum across Categories" = 1;

So another variation, if you have many rows, per id, and the category allocation is not the same across the set, you can use a COUNT_IF and the greater than 0 test to turn the data into a in any, then using a HAVING clause to filter out those that are are in many columns

WITH data AS (
    SELECT * FROM VALUES
    ('Hairbdb',0,1,0,1),
    ('Hairbdb',1,1,0,1),
    ('Hairbdb',0,1,1,1),
    ('Jshshsv',0,0,0,1),
    ('Jshshsv',0,0,0,1)
    v(id, catx, caty, catz, cataa )
)
SELECT id, 
    COUNT_IF(catx=1)>0 AS catx_a,
    COUNT_IF(caty=1)>0 AS caty_a,
    COUNT_IF(catz=1)>0 AS catz_a,
    COUNT_IF(cataa=1)>0 AS cataa_a
FROM data
GROUP BY 1
HAVING catx_a::int   caty_a::int   catz_a::int   cataa_a::int = 1;

CodePudding user response：

if you are storing the categories in columns (though not a good design) you could try this.

SELECT UniqueID ,  sum(CatX CatY CatZ CatAA)   over (partition by UniqueID) as "Sum across Categories",
 CatX,   CatY,  CatZ, CatAA  FROM (
SELECT 'Hairbdb' as UniqueID,   0 as CatX,  1 as CatY,  0 as CatZ,  1 as CatAA from dual
UNION ALL 
SELECT 'Jshshsv',   0,0,0,1 from dual 
);